Regridding: Comparing xarray and SciPy

As I was searching for code options to regrid data instead of relying on tools, I discovered two popular methods: SciPy and xarray.

To explore and compare their capabilities, I downloaded single day 2-meter temperature data (T2m) from the ERA5 satellite. The original data had a resolution of 25 km, and I regridded it to a finer resolution of 1 km using both methods. Below, I discuss the process and findings.

What is Regridding?

Regridding is the process of interpolating (Interpolation is the process of estimating unknown values within a range of known data points) data from one spatial grid to another. This is often required when working with datasets from different sources or when finer resolution is needed for specific applications like climate modeling or geographic analysis.

Dataset

The dataset I used contains T2m data at 25 km resolution over India, dowloaded through ERA5 satellite. The original file is in NetCDF format, and I defined new latitude and longitude bounds (latitude=4 to 40 and longitude=60 to 100, to cover India) to create a target grid with a resolution of 1 km.

‘t2m’ for a single day (ERA5 satellite):

xarray & Scipy

xarray is a Python library for working with labeled multi-dimensional arrays, offering powerful tools for analysis and metadata handling, especially with NetCDF files.

SciPy is a scientific computing library for Python, providing numerical tools like interpolation (griddata) for irregular or regular grid data.

Methods Used by xarray & Scipy

Both xarray and scipy.interpolate uses methods like linear, nearest and cubic, let see what does they mean:

1. Linear Interpolation(Method Name: ‘linear’): Linear interpolation calculates the value at an unknown point by linearly combining the values of surrounding points. It assumes the data forms straight lines or planes between points, resulting in smooth but piecewise surfaces.

2. Nearest Neighbor Interpolation(Method Name: ‘nearest’): Nearest neighbor interpolation assigns the value of the closest data point to the unknown point. It creates a blocky or stepped appearance in the output, as no averaging or smoothing occurs.

3. Cubic Interpolation(Method Name: ‘cubic’): Cubic interpolation uses cubic polynomials to estimate values at unknown points, producing smoother surfaces than linear interpolation. But remember, it considers more surrounding points, which can lead to overshooting in some cases.

From my findings: Cubic > Linear > Nearest

Procedure to Regrid:

Using xarray:

1. xarray.open_dataset() to load the NetCDF file containing the data.

2. Create arrays for the new latitude and longitude with the desired resolution.

3. Use Dataset.interp() to interpolate data to the new grid.

Using scipy:

1. Use xarray.open_dataset() to load the dataset and extract numpy arrays for the data, latitudes, and longitudes.

2. Create a 1D array of coordinate pairs and the corresponding data values.

3. Create a new 2D grid for the desired latitude and longitude resolution.

4. Use scipy.interpolate.griddata() to interpolate the data.

5. Convert the regridded data back into an xarray Dataset for saving and plotting.

Code:

  1. Scipy regridding:
import xarray as xr
import numpy as np
import geopandas as gpd
import regionmask
import os
import matplotlib.pyplot as plt
from scipy.interpolate import griddata

input_dir = 'era5_data/test/'  
output_dir = 'era5_data/test/' 
os.makedirs(output_dir, exist_ok=True)

lat_bounds = (4, 40)
lon_bounds = (68, 98)
new_lats = np.arange(lat_bounds[0], lat_bounds[1] + 0.01, 0.01)
new_lons = np.arange(lon_bounds[0], lon_bounds[1] + 0.01, 0.01)
lon_grid, lat_grid = np.meshgrid(new_lons, new_lats)

for f in os.listdir(input_dir):
    if f.endswith('.nc'):
        file_path = os.path.join(input_dir, f)
        print(f"Processing file for regridding: {file_path}")
        if file_path == 'era5_data/test/1dec2024_2mtemp_era5.nc':
            ds = xr.open_dataset(file_path)
            clipped_ds = ds.sel(latitude=slice(lat_bounds[1], lat_bounds[0]), 
                                longitude=slice(lon_bounds[0], lon_bounds[1]))

            # gdf = gpd.read_file(india_shapefile_path)
            # mask = regionmask.mask_geopandas(gdf, clipped_ds.longitude, clipped_ds.latitude)
            # mask = xr.DataArray(np.logical_not(mask), coords=mask.coords, dims=mask.dims)
            # masked_ds = clipped_ds.where(mask)
            masked_ds = clipped_ds
            variable_name = list(masked_ds.data_vars.keys())[0]
            data = masked_ds[variable_name].values
            lats = masked_ds.latitude.values
            lons = masked_ds.longitude.values
            time = masked_ds.valid_time.values if 'time' in masked_ds.dims else None
            lon_flat, lat_flat = np.meshgrid(lons, lats)
            points = np.array([lon_flat.flatten(), lat_flat.flatten()]).T
            values = data.reshape(-1, data.shape[-1]) if time is not None else data.flatten()

            if time is not None:
                regridded_data = []
                for t_idx in range(data.shape[-1]):
                    interp_data = griddata(points, values[:, t_idx], (lon_grid, lat_grid), method='linear')
                    regridded_data.append(interp_data)
                regridded_data = np.stack(regridded_data, axis=-1)
            else:
                regridded_data = griddata(points, values, (lon_grid, lat_grid), method='cubic')

            regridded_ds = xr.Dataset(
                {
                    variable_name: (['latitude', 'longitude', 'valid_time'] if time is not None else ['latitude', 'longitude'], 
                                    regridded_data)
                },
                coords={
                    'latitude': new_lats,
                    'longitude': new_lons,
                    'valid_time': time if time is not None else None
                }
            )
            # output_nc_path = os.path.join(output_dir, f"regridded_{f}")
            # regridded_ds.to_netcdf(output_nc_path)
            # print(f"Regridded data saved to {output_nc_path}")
            plt.figure(figsize=(12, 8))
            regridded_ds[variable_name].mean(dim='valid_time').plot(cmap='viridis') if time is not None else \
                regridded_ds[variable_name].plot(cmap='viridis')
            plt.title(f'SciPy Cubic: Regridded {variable_name} Data at 0.01° Resolution')
            plt.xlabel('Longitude')
            plt.ylabel('Latitude')
            plot_path = os.path.join(output_dir, f"scipy_cubic_regridded_{f.replace('.nc', '.png')}")
            plt.savefig(plot_path, dpi=600)
            plt.show()
            plt.close()
            print(f"Plot saved to {plot_path}")

2. xarray regridding:

#file = '1dec2024_2mtemp_era5.nc'
import xarray as xr
import numpy as np
import geopandas as gpd
import regionmask
import os
import matplotlib.pyplot as plt

input_dir = 'era5_data/test/' 
output_dir = 'era5_data/test/' 
#india_shapefile_path = 'india_shp/india.shp'
os.makedirs(output_dir, exist_ok=True)

lat_bounds = (4, 40)
lon_bounds = (68, 98)
new_lats = np.arange(lat_bounds[0], lat_bounds[1] + 0.01, 0.01)
new_lons = np.arange(lon_bounds[0], lon_bounds[1] + 0.01, 0.01)

target_grid = xr.Dataset(
    {
        "latitude": (["latitude"], new_lats),
        "longitude": (["longitude"], new_lons),
    }
)

for f in os.listdir(input_dir):
    if f.endswith('.nc'):
        file_path = os.path.join(input_dir, f)
        print(f"Processing file for regridding: {file_path}")
        if file_path == 'era5_data/test/1dec2024_2mtemp_era5.nc':
            ds = xr.open_dataset(file_path)
            clipped_ds = ds.sel(latitude=slice(lat_bounds[1], lat_bounds[0]), 
                                longitude=slice(lon_bounds[0], lon_bounds[1]))

            # gdf = gpd.read_file(india_shapefile_path)
            # mask = regionmask.mask_geopandas(gdf, clipped_ds.longitude, clipped_ds.latitude)
            # mask = xr.DataArray(np.logical_not(mask), coords=mask.coords, dims=mask.dims)
            # masked_ds = clipped_ds.where(mask)
            masked_ds = clipped_ds
            regridded_ds = masked_ds.interp(
                latitude=target_grid.latitude,
                longitude=target_grid.longitude,
                method='nearest'
            )

            # output_nc_path = os.path.join(output_dir, f"regridded_{f}")
            # regridded_ds.to_netcdf(output_nc_path)
            # print(f"Regridded data saved to {output_nc_path}")

            variable_name = list(regridded_ds.data_vars.keys())[0]  
            # plt.figure(figsize=(12, 8))
            regridded_ds[variable_name].mean(dim='valid_time').plot(cmap='viridis')
            plt.title(f'nearest_Regridded {variable_name} Data at 0.01° Resolution')
            plt.xlabel('Longitude')
            plt.ylabel('Latitude')
            plot_path = os.path.join(output_dir, f"nearest_regridded_{f.replace('.nc', '.png')}")
            plt.savefig(plot_path, dpi=600)
            plt.show()
            plt.close()
            print(f"Plot saved to {plot_path}")

Result:

Below is the regridded output of above dataset using Scipy and xarray:

  1. Regridded using Scipy

2. Regridded using xarray

Comparison:

xarray is:

- Easy to use with labeled dimensions, making it beginner-friendly.

- Designed to retain metadata, preserving attributes like variable names and units.

- Optimized for large multidimensional datasets, ensuring efficient computation.

SciPy is:

- Flexible and supports irregular grids, making it suitable for non-standard use cases.

- Packed with multiple interpolation methods for versatile applications.

- Computationally slower for large datasets and does not retain metadata, requiring manual management of attributes.

The major advantage of xarray over scipy is its ability to retain metadata during interpolation, here metadata refers to auxiliary information about a dataset that describes its structure, attributes, or additional information related to the data values.

Also, you will notice that regridding with SciPy will be slower than regridding with xarray.

Conclusion

Both SciPy and xarray are excellent tools for regridding, but the choice depends on your requirements:

- Use SciPy for irregular grids or custom workflows.

- Use xarray for simplicity and labeled data with metadata preservation.

Did you find this article valuable?

Support Rohan Anand by becoming a sponsor. Any amount is appreciated!