After this unit you should be able to
With the growth of satellite technology, environmental models, and computational capacity, environmental observations has entered the stage of environmental big data (Figure 1):
Companies like google have realized that these amounts of data are useful not only for environmental problem solving, but also for many other applications.
To give one example, google has curated more than 90 PB of Earth Data Products that have been integrated into Google Earth Engine, and data analysis platform using google cloud infrastructure (Figure 2).
Big data definitions may vary slightly, but it will always be described in terms of volume, velocity, and variety. These big data characteristics are often referred to as the “3 Vs of big data” Google Cloud
Because big data has volume and velocity is has to be handled in certain ways and there are data formats and processing pipelines that do so.
A frequent data format for environmental data is the NetCDF format (*.nc) developed by Unidata:
Data in netCDF format is:
- Self-Describing. A netCDF file includes information about the data it contains.
- Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
- Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.
- Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
- Sharable. One writer and multiple readers may simultaneously access the same netCDF file. (Unidata)
Zarr is becoming increasingly popular, and has become the de-facto standard for cloud storage with AWS or Microsoft.
Xarray is a python library that is designed to work with multi-dimensional datasets. It has a similar syntax to pandas and includes many operations such as aggregation and plotting (Figure 3).
Datasets that are read into xarray have dimensions and variables that can be used to interact with the data. This makes it ideal to work with Earth data such as gridded datasets.
Earth data is often organized as a gridded data product, with latitude and longitude as dimensions. Adding multiple times, or levels (for example height or depth of atmospheric or oceanic data)can lead to 3 or 4 dimensional datasets.

Our first xarray example will be using weather data from a so-called reanalysis dataset.
Reanalysis uses past observations of weather together with short-range weather forecasting models to create a globally complete (even where we don’t have any stations) picture of past weather and climate (Figure 4):
The European Reanalysis 5 (ERA5) is currently the most advanced and high resolution product. It covers the time period of 1940 to today at a resolution of 0.25 x 0.25 degrees and with hourly output

There are also gridded data products that rely on statistical models to extrapolate observations to a grid. These tend to only work for areas with a high density of observation stations such as the US or Europe. Examples of these are: