GeoBrix is a high-performance spatial processing library. Its heavy-weight readers and functions are powered by GDAL, implemented on Apache Spark, and built to run exclusively on the Databricks Runtime (DBR), see docs for more.
Now that product built-in Spatial SQL Functions have reached public preview as of DBR17.1, we are seeking to deliver the next generation of product-augmenting capabilities to help our customers. GeoBrix project is a streamlined iteration to the existing, and quite popular, DBLabs Mosaic project. Beyond just porting existing Mosaic code, GeoBrix is modernized with expressions designed to work with our Data Intelligence Platform. GeoBrix will be a combination of heavy-weight (e.g. JAR) as well as lightweight (e.g Python, SQL) code artifacts. It also will focus on techniques to use the Databricks platform more widely.
With Databricks first having acquired MosaicML and now having made a product line, Mosaic AI, it has become clear that the DBLabs Mosaic project, sharing the name, needs to be revamped in name as well as any existing Mosaic capabilities that compete with product investments. If this were not the case, we would have simply iterated on DBLabs Mosaic “in-place” keeping the same name for what is now called GeoBrix. DBLabs Mosaic is in maintenance mode. The latest/last version of Mosaic targets DBR 13.3 LTS since product introduced ST functions starting with DBR 14. As such, Mosaic does not have any awareness of advancements in recent runtimes, including product support for spatial sql and types, and will be retired with DBR 13.3 EoS in AUG 2026.
GeoBrix offers heavy-weight packages for Raster, Grid, and Vector that are intended to augment and compliment ongoing Databricks product initiatives.
Refactor and improvement of Mosaic raster functions. Product does not (yet) support anything built-in specifically for raster, so this is a “fully” gap-filling capability.
Refactor of Mosaic discrete global grid indexing functions. Focus has been on porting BNG for Great Britain customers.
Refactor of select DBLabs Mosaic vector functions that augment existing product ST Geospatial Functions. Right now, this only includes a single function to handle updating existing Mosaic geometry data to those supported by product, so that users do not need to install (older) Mosaic in order to get to using the latest spatial features.
The following spark readers are automatically registered with the JAR on the classpath.
- “sizeInMB” → defaults to “16” - split the file if over the threshold
- “filterRegex” → defaults to “.*” - filter loaded files from the provided directory
We are really only focused on GeoTiffs right now, but you are free to try to load any available driver with something like:
(
spark
.read.format(“gdal”)
.option("driverName", "<driver>") # if not provided, extension is used to detect
.load("<path>")
)
The following are available and call the “gdal” reader with some options explicitly set.
Read GeoTIFF raster files - the most common geospatial raster format. This is a named GDAL Reader, sets “driverName” → "GTiff":
- GDAL auto-associates GeoTiff and BigTiff extensions to this driver, e.g. .tif files
- With the named reader, the driver is specified to be used regardless of extension
- Can use the other available "gdal" reader options
(
spark
.read.format(“gtiff_gdal”)
.load("<path_to_supported_files>")
)
The output will look something like the following, with tile column now ready to use with other RasterX APIs.
- “driverName” → if not provided, GDAL uses best guess based on file extension
- “chunkSize” → default "10000" - number of records for multi-threading per file reading
- “layerN” → default “0” - for file formats that use layers
- “layerName” → default “” - for file formats that use layers
- “asWKB” → default “true” - whether to return WKB or WKT geometry results
Note: For the Beta, results are not converted to Databricks new native spatial types for GEOMETRY / GEOGRAPHY, so this would be an additional step once the data has been read.
The following are available and call the “ogr” reader with some options explicitly set.
This is a named OGR Reader, sets “driverName” → "ESRI Shapefile":
- GDAL auto-associates the following extensions to this driver: .shz files (ZIP files containing the .shp, .shx, .dbf and other side-car files of a single layer) and .shp.zip files (ZIP files containing one or several layers)
- With the named reader, GDAL can additionally handle .zip files (ZIP files containing one or several layers)
(
spark
.read.format(“shapefile_ogr”)
.load("<path_to_supported_files>")
)
The output will look something like the following, maintaining attribute columns and having 3 columns for geometry: ‘geom_0’, ‘geom_0_srid’, and ‘geom_0_srid_proj’.
This is a named OGR Reader.
- “multi” → default “true”
- when “true” set “driverName” → "GeoJSONSeq"
- otherwise | “GeoJSON”
(
spark
.read.format(“geojson_ogr”)
.option("multi", "false") # if not provided, "true" assumed
.load("<path_to_supported_files>")
)
The output will look something like the following, maintaining attribute columns and having 3 columns for geometry: ‘geom_0’, ‘geom_0_srid’, and ‘geom_0_srid_proj’.
This is a named OGR Reader, sets “driverName” → "GPKG".
(
spark
.read.format(“gpkg_ogr”)
.load("<path_to_supported_files>")
)
The output will look something like the following, maintaining attribute columns and having 3 columns for geometry: ‘shape’, ‘shape_srid’, and ‘shape_srid_proj’.
This is a named OGR Reader, sets “driverName” → "OpenFileGDB". You also may want to specify options “layerN” or “layerName” to read the desired layer.
(
spark
.read.format(“file_gdb_ogr”)
.load("<path_to_supported_files>")
)
The output will look something like the following, maintaining attribute columns and having 3 columns for geometry: ‘SHAPE’, ‘SHAPE_srid’, and ‘SHAPE_srid_proj’. Note: column names are case insensitive.
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
GeoBrix currently offers heavy-weight, distributed APIs, primarily written in Scala for Spark with additional language bindings for PySpark and Spark SQL. See docs for more information on installing and using available readers and functions.
Cluster Config
GeoBrix requires GDAL natives, which are best installed via an init script on a classic cluster
- Add the GeoBrix JAR and Shared Object ('*.so') to the Volume - currently these are delivered via Releases artifacts.
- Add geobrix-gdal-init.sh to a chosen Databricks Volume; note: prior to copying, modify 'VOL_DIR' to the location of the artifacts in (1).
- Add the WHL as a cluster library.
To get up and running with PySpark bindings and SQL function registration in a cluster, execute the following (note: you do not need to do this if you are just using the included readers):
from databricks.labs.gbx.rasterx import functions as rx
rx.register(spark)
You can quickly list the registered functions with a SQL command.
%sql
-- hint: you can sort the return column
show functions like 'gbx_rst_*'
Describe any registered function for more details.
%sql describe function extended gbx_rst_boundingbox
The heavy-weight API is written in Scala with various spark optimizations implemented with best practices, including using Spark Connect to invoke the columnar expressions. The pattern for registering functions is com.databricks.labs.gbx.<category>.functions where ‘gbx’ is the convention for GeoBrix in classpaths:
- VectorX - sample function
vx.st_legacyaswkb.import com.databricks.labs.gbx.vectorx.{functions => vx}vx.register(spark)vx.<function>
- GridX (showing BNG)
import com.databricks.labs.gbx.gridx.bng.{functions => bx}(same registrations and execution pattern as VectorX) - sample functionbx.bng_cellarea. - RasterX
import com.databricks.labs.gbx.rasterx.{functions => rx}(same registrations and execution pattern as VectorX) - sample functionsrx.rst_clip.
The python bindings are a lightweight wrapper to the underlying Scala columnar expressions via Spark Connect. Functions are registered in a similar manner as with scala:
- VectorX
from databricks.labs.gbx.vectorx import functions as vxvx.register(spark)vx.<function>
- GridX (showing BNG)
from databricks.labs.gbx.gridx.bng import functions as bx(same registrations and execution pattern as VectorX) - RasterX
from databricks.labs.gbx.rasterx import functions as rx(same registrations and execution pattern as VectorX)
All GeoBrix SQL functions will be registered with gbx_ prefix. This reflects a lesson learned from previous experiences, where functions registered without a prefix is unattributable to any particular provider on classic compute, e.g. cannot tell whether st_ invoked within classic compute is from product or Sedona, etc., but usage will be easily attributable to GeoBrix when gbx_st_ is invoked:
- Sample vector expression:
gbx_st_legacyaswkb - Sample grid expression:
gbx_bng_cellarea - Sample raster expression:
gbx_rst_clip
See the scripts folder for more information.
- The Beta does not yet support Databricks Spatial Types directly but is standardized to WKB or WKT where geometries are involved. In addition to content in the user guide, the provided notebooks, e.g. Shapefile Reader, have examples of converting to our built-in GEOMETRY type and using our built-in ST Geospatial Functions.
- A handful of functions are not yet ported. For raster:
rst_dtmfromgeomsand for vector:st_interpolateelevationandst_triangulate. - Spatial KNN is not yet ported; neither is H3 support for Geometry-based K-Ring and K-Loop.
- Custom Gridding is not fully ported.












