Overcoming large-scale geospatial analysis using Google BigQuery Geospatial

By Brian Suk | Associate CTO, Data Engineering and Analytics

Introduction

The drive for location analytics has increased the need for geospatial analytics at scale. Things such as connected devices, tracking digital twins, and sustainability initiatives have significantly increased the volume and velocity of captured geolocation data, stressing the geospatial processing systems and the related storage subsystems. There are many ways to analyze large geospatial data sets – unfortunately, many are slow or fail if the dataset is too large. When trying to make comparisons between technologies, it becomes difficult as the geospatial space has no commonly accepted benchmarking framework (there is no “TPC-H for GIS,” so to speak). The Defence Science and Technology Laboratory (DSTL) in the United Kingdom commissioned and published a benchmark whitepaper, “Benchmarking Technologies with Geospatial Big Data,” comparing many commonly used technologies, including Geospark, Postgres, and MongoDB, and proves to be a solid place to start for creating common tests across GIS technologies.

The original DSTL benchmark used 20 n1-standard-16 Google Compute Engine (GCE) virtual machines to generate three geospatial datasets. ​​After loading the datasets into the various databases, performance was measured by running a series of queries across each technology with differing levels of query complexity. Three of the technologies compared in the benchmark utilized Hadoop Distributed File System (HDFS), with a five-node GCE environment performing the load to HDFS. The other technologies had various other loading mechanisms.

The DSTL whitepaper provides a detailed analysis of each technology and draws a few general conclusions:

  • Data ingestion can be time and computationally expensive
  • Infrastructure (compute, storage, monitoring, etc.) usage and management can be expensive
  • Data Indexing (and re-indexing in a live environment) also needs to be managed as many systems rely on that to provide adequate performance.

Here at SADA, with our expertise in the Google Cloud Platform (GCP), we wanted to test these challenging analytics requirements using the geospatial capabilities of BigQuery by replicating the data generation, data loading, and querying benchmarks as closely as we could. Not only were we successful, but the performance also proved to be highly responsive and an excellent option for geospatial data analytic challenges at scale.

We acknowledge the original DSTL benchmark occurred in 2017, and the tested technologies have since improved significantly. For our analysis, the benchmark provides a basis for analysis and comparison. As such, we welcome executing these tests on other data technologies. 

Below are the key findings and a summary of our analysis.  If you would like to replicate this test, please look at the technical deep dive post, which documents and explains all utilized queries.

Environment

First, it’s essential to understand BigQuery’s pricing and resourcing model (this is strictly about analysis resources, as storage pricing is consistent). By default, BigQuery utilizes on-demand pricing, which means you simply pay for the cost of the query and bytes scanned, as described in Google documentation. Also, by default, for performance BigQuery provides up to 2,000 slots with transient bursts, which makes benchmarking unpredictable. BigQuery Reservations provides access to a consistent number of slots with fixed pricing depending on the number of reserved slots and duration of commitment (by the year, month, or minute). If you decide to run these tests yourself, ascertain if your organization has an existing reservation you can leverage. If not, consider using flex slots for this (here is another great blog post giving an example of the power of flex slots). For our tests, we utilized 2,000 flex slots consistency, repeatability, and cost optimization.

Data Generation

Compared to the DSTL benchmark, we chose a different approach for data generation. Instead of externally generating the data and then loading it, we explored the feasibility of generating the test data directly within BigQuery. An excellent blog post written by a Googler discusses using the faker.js framework to generate large volumes of data by running it within a BigQuery Javascript UDF. To ensure alignment with the DSTL datasets, we needed long text and geographies, both supported faker.js data types. The following table compares BigQuery’s data generation times of 10 billion rows relative to the benchmark HDFS load times.

Dataset numberTime elapsed (2000 Slots)HDFSTime relative to BigQuery
168.232785.940.83x
23.580.70.2x
3101.373100.930.59x

Note that the times for HDFS are merely for loading the generated data and do not include data generation itself, whereas the times for BigQuery include data generation. HDFS was slightly faster with loading dataset 2, but the large datasets were significantly faster with BigQuery.

Data querying

Below are the BigQuery results and the DSTL findings on pages 44 through 56 of their white paper. All times are in minutes – note how many times longer a particular technology took compared to BigQuery.

Dataset 1 – Query 1

Dataset 1 – Query 1
Benchmark queryselect * from dataset1 where dataset1.geo is within bbox1
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.721x
Hive27.8538.68x
Geospark89.63124.49x
MongoDB282.5392.36x
Elasticsearch55.6977.35x
GeoMesa0.771.07x
Postgres-XL0.771.07x

Dataset 1 – Query 2

Dataset 1 – Query 2
Benchmark queryselect * from dataset1 where dataset1.geo is within bbox2
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.481x
Hive27.557.29x
Geospark86.74180.71x
MongoDB00x
Elasticsearch3.296.85x
GeoMesa0.120.25x
Postgres-XL00x

Dataset 1 – Query 3

Dataset 1 – Query 3
Benchmark queryselect * from Dataset 1 where Dataset 1.timestamp < time1 and Dataset 1.timestamp > time2
TechnologyRuntimeMultiplier to BigQuery
BigQuery2.081x
Hive44.0921.2x
Geospark62.2729.94x
MongoDB00x
Elasticsearch47.822.98x
GeoMesa344.67165.71x
Postgres-XL00x

Dataset 1 – Query 4

Dataset 1 – Query 4
Benchmark queryselect * from Dataset 1 where Dataset 1.geo is within bbox1 and Dataset 1.timestamp
< time1 and Dataset 1.timestamp > time2
TechnologyRuntimeMultiplier to BigQuery
BigQuery1.081x
Hive28.1126.03x
Geospark116.69108.05x
MongoDB71.6866.37x
Elasticsearch44.6141.31x
GeoMesa0.690.64x
Postgres-XL00x

Dataset 1 – Query 5

Dataset 1 – Query 5
Benchmark queryselect * from Dataset 1 where Dataset 1.geo is within bbox2 and Dataset 1.timestamp
< time1 and Dataset 1.timestamp > time2
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.421x
Hive2866.67x
Geospark105.59251.4x
MongoDB00x
Elasticsearch1.212.88x
GeoMesa0.451.07x
Postgres-XL00x

Dataset 3 – Query 1

Dataset 3 – Query 1
Benchmark queryselect * from Dataset 1 where Dataset 1.geo is within bbox2 and Dataset 1.timestamp
< time1 and Dataset 1.timestamp > time2
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.581x
Hive99.84172.14x
Geospark159.35274.74x
MongoDB00x
Elasticsearch00x
GeoMesa2.173.74x
Postgres-XL00x

Dataset 3 – Query 2

Dataset 3 – Query 2
Benchmark querySelect * from dataset3 where dataset3.geo is within 10km of any point in dataset2.geo
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.31x
Hive97.8326x
Geospark110.54368.47x
MongoDB00x
Elasticsearch00x
GeoMesa0.130.43x
Postgres-XL00x

Dataset 3 – Query 3

Dataset 3 – Query 3
Benchmark querySelect * from dataset3 where dataset3.geo intersects bbox1 and dataset3.timestamp
< time1 and dataset3.timestamp > time2
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.771x
Hive101.46131.77x
Geospark98.23127.57x
MongoDB00x
Elasticsearch00x
GeoMesa3.284.26x
Postgres-XL00x

Dataset 3 – Query 4

Dataset 3 – Query 4
Benchmark querySelect * from dataset3 where dataset3.geo intersects bbox2 and dataset3.timestamp
< time1 and dataset3.timestamp > time2
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.271x
Hive100.35371.67x
Geospark114.85425.37x
MongoDB00x
Elasticsearch00x
GeoMesa1.043.85x
Postgres-XL00x

Dataset 1 – Query 9

Dataset 1 – Query 9
Benchmark querySelect * from dataset1 where dataset1.geo is within 10km of point1, order by distance to point1
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.071x
Hive28.22403.14x
Geospark85.061215.14x
MongoDB0.547.71x
Elasticsearch0.395.57x
GeoMesa1.0414.86x
Postgres-XL00x

Dataset 1 – Query 10

Dataset 1 – Query 10
Benchmark querySelect * from dataset 1 where dataset1.geo is within 10km of point1, order by timestamp
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.071x
Hive27.98399.71x
Geospark83.791197x
MongoDB00x
Elasticsearch0.537.57x
GeoMesa0.8812.57x
Postgres-XL00x

Dataset 3 – Query 5

Dataset 3 – Query 5
Benchmark querySelect * from dataset3 where dataset3.geo is within 10km of point1, order by closest distance to point1
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.231x
Hive104.72455.3x
Geospark93.22405.3x
MongoDB00x
Elasticsearch00x
GeoMesa0.883.83x
Postgres-XL00x

Dataset 3 – Query 6

Dataset 3 – Query 6
Benchmark querySelect * from dataset3 where dataset3.geo is within 10km of point1, order by timestamp
TechnologyRuntimeMultiplier to BigQuery
BigQuery0.251x
Hive103.94415.76x
Geospark101.11404.44x
MongoDB00x
Elasticsearch00x
GeoMesa0.823.28x
Postgres-XL00x

Join queries

These were interesting as all the benchmarked technologies failed (as shown on page 56 of the report). Here are the two queries:

Join query 1: Select * from dataset1 where dataset1.geo is within 10km of any point in dataset2.geo
Join query 2: Select * from dataset3 where dataset3.geo is within 10km of any point in dataset2.geo

The technical deep-dive blog discusses the performance tuning required to execute this in BigQuery, and the timing is as follows:

Join queries
Dataset 12.37
Dataset 23.07

Conclusion

Overall, BigQuery is significantly faster than the DSTL benchmark technologies. Additionally, BigQuery eliminates all infrastructure and management costs along with database indexing. For more information about BigQuery other benefits, refer to Google’s BigQuery overview.

If you would like to recreate this exercise on your own in BigQuery, or would like to see the underlying engineering processes so you can recreate this in another technology, please see this external blog post that goes through all the details of both data generation, and benchmark querying.

Google Cloud Logo Icon

THE GCP VS AWS DEBATE

We spoke to dozens of customers who shared their experiences with both cloud providers. The overwhelming trends tell a big story. Download the eBook to learn more.

Solve not just for today but for what's next.

We'll help you harness the immense power of Google Cloud to solve your business challenge and transform the way you work.

Scroll to Top