Environmental Big Data and Xarray

Background

Learning Goals

After this unit you should be able to

explain what big data is
define the _3V_s of big data
relate environmental data to big data
define gridded data
explain the characteristics of netcdf files
describe how xarray as a tool can be used to work with environmental big data

Environmental Big Data

With the growth of satellite technology, environmental models, and computational capacity, environmental observations has entered the stage of environmental big data (Figure 1):

(a) The volume of data produced has grown exponentially and is expected to soon reach 40 Zettabytes (40 trillion Gigabytes). Such data generation is only possible due to the concurrent growth in data storage and computing speed, which has moved from the floppy disk (~1 calculations per second per $1000) to cloud-based storage (>1015 calculations) in last 30 years. Despite this exponential growth in technological capacity, and increasing environmental applications, our planet is still facing serious environmental declines (b). All environmental declines shown are sourced from prior studies (as detailed below) and are indexed relative to their state in the first year plotted (i.e., dividing by the first value in each time series), with the exception of Antarctic ice sheet mass change, which was indexed against expected (BAU) loss by 2100 (81 cm sea-level rise equivalent). Tidal flats represent the overall decline across the globe for time period, and does not show annual fluctuations. Intact Forest Landscapes and tree cover loss does not take into account gains. Note the index on the y-axis is only shown for the range 0.9–1. ƗData from Global fishing watch. *Based on the global human footprint8. ^For at least 1 month a year over the period 1996–2005. — Figure 1: The volume of data produced has grown exponentially and is expected to soon reach 40 Zettabytes (40 trillion Gigabytes). Such data generation is only possible due to the concurrent growth in data storage and computing speed, which has moved from the floppy disk (~1 calculations per second per 1000 USD) to cloud-based storage (>1015 calculations) in last 30 years1. Despite this exponential growth in technological capacity, and increasing environmental applications, our planet is still facing serious environmental declines; Credit: Nature Communications

Companies like google have realized that these amounts of data are useful not only for environmental problem solving, but also for many other applications.

To give one example, google has curated more than 90 PB of Earth Data Products that have been integrated into Google Earth Engine, and data analysis platform using google cloud infrastructure (Figure 2).

Meet Earth Engine Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities. Scientists, researchers, and developers use Earth Engine to detect changes, map trends, and quantify differences on the Earth's surface. Earth Engine is now available for commercial use, and remains free for academic and research use. — Figure 2: Google Earth Engine

What is special about big data?

Big data definitions may vary slightly, but it will always be described in terms of volume, velocity, and variety. These big data characteristics are often referred to as the “3 Vs of big data” Google Cloud

Volume: This one is quite self-explanatory. We are now dealing with an incredible volume of data that is being collected from a large range of sources. Dealing with the sheer amount requires infrastructure for storing and processing all this data.
Velocity: This is the speed at which data is being generated. Environmental measurements are automated and collected in real-time (or close to).
Variety: Environmental data comes from many different sources and formats. Automated sensors, satellites, environmental models …

Pipelines and format

Because big data has volume and velocity is has to be handled in certain ways and there are data formats and processing pipelines that do so.

Datasets are distributed across chunks that are stored in multiple files that reside on cloud servers and can be accessed by many computers at the same time.
Because the entire dataset is too big to fit into the working memory (RAM) of a computer, datasets are indexed rather than read into RAM. For example, when opening a dataset with xarray, xarray will not actually try to read the data, but only load information about what is stored. Only when computations are executed to produce a result, the necessary data will be read into memory. This principle is also called lazy execution. This also requires the processing software to be efficient when interacting with chunked datasets.

NetCDF

A frequent data format for environmental data is the NetCDF format (*.nc) developed by Unidata:

Data in netCDF format is:

Self-Describing. A netCDF file includes information about the data it contains.

Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.

Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.

Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.

Sharable. One writer and multiple readers may simultaneously access the same netCDF file. (Unidata)

Zarr is becoming increasingly popular, and has become the de-facto standard for cloud storage with AWS or Microsoft.

Xarray

Xarray is a python library that is designed to work with multi-dimensional datasets. It has a similar syntax to pandas and includes many operations such as aggregation and plotting (Figure 3).

Datasets that are read into xarray have dimensions and variables that can be used to interact with the data. This makes it ideal to work with Earth data such as gridded datasets.

Gridded data products

Earth data is often organized as a gridded data product, with latitude and longitude as dimensions. Adding multiple times, or levels (for example height or depth of atmospheric or oceanic data)can lead to 3 or 4 dimensional datasets.

Weather data

Our first xarray example will be using weather data from a so-called reanalysis dataset.

Reanalysis uses past observations of weather together with short-range weather forecasting models to create a globally complete (even where we don’t have any stations) picture of past weather and climate (Figure 4):

Figure 4: A schematic of the reanalysis process. Credit: [ECMWF]https://www.ecmwf.int/en/about/media-centre/focus/2023/fact-sheet-reanalysis)

The European Reanalysis 5 (ERA5) is currently the most advanced and high resolution product. It covers the time period of 1940 to today at a resolution of 0.25 x 0.25 degrees and with hourly output

ERA 5 Overview
There are also gridded data products that rely on statistical models to extrapolate observations to a grid. These tend to only work for areas with a high density of observation stations such as the US or Europe. Examples of these are:
- DAYMET at 1 km resolution produced by Oak Ridge National Labs
- PRISM Weather Data at 4 km resolution produced by Oregon State University
- GRIDMET at 4 km resolution produced by the ClimatologyLab.