More Xarray: Understanding Climatology Through Precipitation Data

This tutorial is adapted from the excellent materials of the Climate Match Academy that provides a comprehensive introduction to computational tools in climate science.

Tutorial Objectives

In this tutorial, you will explore the concept of a climatology, and learn how to leverage it using satellite precipitation data. Typically you would want your data to span at least 30 years to calculate a climatology to account for random year-to-year variability. Here you will use data spanning several decades to explore the seasonal cycle of precipitation at a specific location.

Climate and Conceptional Skills

Upon completing this tutorial, you’ll be able to:

Comprehend the fundamentals of climatologies.
Compute a climatology utilizing long-term satellite precipitation data.
Create informative maps including features such as projections, coastlines, and other advanced plotting components.
Calculate an anomaly to a climatology.
Calculate the rolling mean of the anomaly data to smooth the time series and extract long-term signals/patterns.

Throughout this tutorial, you’ll employ NOAA’s monthly precipitation climate data records as the primary resource to demonstrate the process of calculating a long-term climatology for climate analysis. Specifically, you’ll use the Global Precipitation Climatology Project (GPCP) Monthly Precipitation Climate Data Record (CDR). As part of your investigation, you’ll focus on a specific location, observing its data across the entire time duration covered by the GPCP monthly dataset.

Technical Skills

This tutorial is also intended to further practice data analysis with python and Xarray.

Upon completing this tutorial, you’ll be able to:

Retrieve files using pooch to ensure reproducibility
Access climate data through an API from AWS
Load and subset data from netCDF files using Xarray.
Practice making and refining maps using Xarrays plotting functions.
Apply Xarray grouping operations including .groupby(), .rolling()

Importing libraries

For this tutorial we need a number of libraries. You have already encountered many of these before. Additionally, there are a couple of new libraries.

s3fs, boto3, and botocore are libraries that help python interface with Amazon AWS, where the climate data that we will be working with is stored.

pooch: Pooch ¹ is a Python library that can manage data by downloading files from a server (only when needed) and storing them locally in a data cache (a folder on your computer). This allows you to integrate the download of your data into your data pipelines making them more reproducible. pooch also creates a unique signature (a hash) for each downloaded file. If a file changes the hash will be different, so that you know when the underlying data is changed.

# system libraries that we will use
import os
import tempfile
# you already know these
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
# new libraries 
import pooch 
import s3fs
import boto3
import botocore

Section 1: Obtain Monthly Precipitation Data

In this tutorial, the objective is to demonstrate how to calculate the long-term precipitation climatology using monthly precipitation climate data records from NOAA.

You’ll be utilizing the Global Precipitation Climatology Project (GPCP) Monthly Precipitation Climate Data Record (CDR). This dataset contains monthly satellite-gauge data and corresponding precipitation error estimates from January 1979 to the present, gridded at a 2.5°×2.5° resolution. Satellite-gauge means that the climate data record (CDR) is a compilation of precipitation data from multiple satellites and in-situ sources, combined into a final product that optimizes the advantages of each type of data.

While a higher spatial resolution (1°×1°) at daily resolution exists for varied applications, we will restrict ourselves to the coarser resolution monthly data due to computational limitations. However, you are encouraged to delve into the daily higher resolution data for your specific project needs.

Section 1.1: Access GPCP Monthly CDR Data on AWS

All the National Atmospheric and Oceanic Administration Climate Data Record (NOAA-CDR) datasets are available both at NOAA National Centers for Environmental Information (NCEI) and commercial cloud platforms. Here, we are accessing the data directly via the Amazon Web Service (AWS). You can get more information about the NOAA CDRs on AWS’s Open Data Registry.

To perform analysis, we will need to access the monthly data files from AWS first.

Using s3fs, we can browse the AWS filesystem (also called S3 buckets).

# connect to the AWS S3 bucket for the GPCP Monthly Precipitation CDR data
fs = s3fs.S3FileSystem(anon=True)

# get the list of all data files in the AWS S3 bucket fit the data file pattern.
file_pattern = "noaa-cdr-precip-gpcp-monthly-pds/data/*/gpcp_v02r03_monthly_*.nc"
file_location = fs.glob(file_pattern)
file_location[0:3]

We can see that files are organized in a clear pattern. Each data product is organized in a folder (noaa-cdr-precip-gpcp-monthly-pds).

Files are then organized by year into subfolders.

The filenames themselves also have a pattern:

Product Name: gpcp (Global Precipitation Climatology Product)

Version: v02r03 (Version 2, Revision 3)

Temporal resolution: monthly

Date: d<yyyymm>

Processing Time: c20170616 (This will change for each file based on when the file was processed)

File format: .nc (netCDF-4 format)

In other words, if we are looking for the data of a specific day, we can easily locate where the file might be.

For more on data on AWS see here: https://comptools.climatematch.io/tutorials/W1D3_RemoteSensing/student/W1D3_Tutorial2.html

print("Total number of GPCP Monthly precipitation data files:")
print(len(file_location))

We have more than 500 GPCP monthly precipitation CDR data files in the AWS S3 bucket. Each data file contains the data of each month globally starting from January 1979. Now, let’s open a single data file to look at the data structure before we open all data files.

# first, open a client connection
client = boto3.client(
    "s3", config=botocore.client.Config(signature_version=botocore.UNSIGNED)
)  # initialize aws s3 bucket client

# read single data file to understand the file structure
ds_single = xr.open_dataset(pooch.retrieve('http://s3.amazonaws.com/'+file_location[0],known_hash=None )) # open the file
ds_single.data_vars

From the information provided by xarray, there are a total of five data variables in this monthly data file, including precip for the monthly precipitation and precip_error for the monthly precipitation error.

ds_single.coords

All data is organized in three dimensions: latitude, longitude, and time. We want to create a three-dimensional data array for the monthly precipitation data across the entire data period (from January 1979 until present) so we must open all the available files

file_ob = [pooch.retrieve('http://s3.amazonaws.com/'+file,known_hash=None ) for file in file_location]

# using this function instead of 'open_dataset' will concatenate the data along the dimension we specify
ds = xr.open_mfdataset(file_ob, combine="nested", concat_dim="time")

# comment for colab users only: this could toss an error message for you.
# you should still be able to use the dataset with this error just not print ds
# you can try uncommenting the following line to avoid the error
# ds.attrs['history']='' # the history attribute have unique chars that cause a crash on Google colab.
ds

# I also downloaded the data and created a single nc file that is placed in the Data folder. We can use this if pooch takes too long. 
# ds = xr.open_mfdataset('../Data/noaa_gpcp.nc')

In the above code, we used combine='nested', concat_dim='time' to combine all monthly precipitation data into one data array along the dimension of time. This command is very useful when reading in multiple data files of the same structure but covering different parts of the full data record.

Since we are interested in the precipitation data globally at this moment, let’s extract the entire data array of precipitation from the entire dataset.

# examine the precipitation data variable
precip = ds.precip
precip

As you can see, the data array has the dimensions of time longitude latitude. Before delving into further analysis, let’s visualize the precipitation data to gain a better understanding of its patterns and characteristics.

Section 1.2: Visualize GPCP Data Using Cartopy

In previous tutorials, we’ve learned how to make simple visualization using matplotlib using latitude and longitude as the y-axis and x-axis.

# create simple map of the GPCP precipitation data using matplotlib
fig, ax = plt.subplots()

# use the first month of data as an example
precip.sel(time="1979-01-01").plot(ax=ax)

From the figure, the boundary between land and ocean, especially for North and South America, can be observed vaguely. However, this visualization is not ideal as it requires some guesswork in identifying the specific regions. To overcome this limitation and enhance the visualization, we will employ cartopy, a library that offers advanced mapping features. With cartopy, we can incorporate additional elements such as coastlines, major grid markings, and specific map projections.

# visualize the precipitation data of a selected month using cartopy

# select data for the month of interest
data = precip.sel(time="1979-01-01", method="nearest")

# initiate plot, note that it is possible to change the size of the figure,
# which is already defined within the "cma.mplstyle" file in the 'Figure Settings' cell above
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.Robinson()},
                       #figsize=(9, 6)
                      )

# add coastlines to indicate land/ocean
ax.coastlines()

# add major grid lines for latitude and longitude
ax.gridlines()

# add the precipitation data with map projection transformation
# also specify the maximum 'vmax' and minimum value 'vmin' or set 'robust=True' to increase the
# contrast in the map.
data.plot(
    ax=ax,
    transform=ccrs.PlateCarree(),
    #vmin=0,
    #vmax=20,
    robust=True,
    cbar_kwargs=dict(shrink=0.5, label="GPCP Monthly Precipitation \n(mm/day)"), # this controls the colorbar appearance
)

The updated map provides significant improvements, offering us a wealth of information to enhance our understanding of the GPCP monthly precipitation data. From the visualization, we can observe that regions such as the Amazon rainforest, the northern part of Australia, and other tropical areas exhibited higher levels of monthly precipitation in January 1979. These patterns align with our basic geographical knowledge, reinforcing the validity of the data and representation.

Coding Exercises 1.2

Remember the GPCP also offers a data variable that documents the error of the monthly precipitation data used above. This error information is valuable for understanding the level of confidence we can place on the data.

Generate the precipitation error for the same month (1979-01-01) using the examples provided above.

# select data for the month of interest
data = ...

# initiate plot
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.Robinson()})

# add coastal lines to indicate land/ocean
_ = ...

# add grid lines for latitude and longitude
_ = ...

# add the precipitation data
_ = ...

Click for solution

Questions 1.2: Climate Connection

Comment on the spatial pattern of the precipitation error provided by GPCP CDR data for this specific month.
Which areas have the highest errors? Why do you think this might be?

Section 2: Climatology

Section 2.1: Plot Time Series of Data at a Specific Location

We have over 40 years of monthly precipitation data. Let’s examine a specific location throughout the entire time span covered by the GPCP monthly data. For this purpose, we will focus on the data point located at (0°N, 0°E).

# select the entire time series for the grid that contains the location of (0N, 0E)
grid = ds.precip.sel(latitude=0, longitude=0, method="nearest")

# initiate plot
fig, ax = plt.subplots()

# plot the data
grid.plot(ax=ax)

# remove the automatically generated title
ax.set_title("")

# add a grid
ax.grid(True)

# edit default axis labels
ax.set_xlabel("Time (months)")
ax.set_ylabel("GPCP Monthly Precipitation \n(mm/day)")

From the time series plot, note a repeating pattern with a seasonal cycle (roughly the same ups and downs over the course of a year, for each year). In previous tutorials during the climate system overview, you learned how to calculate a climatology. We can apply this same calculation to the precipitation CDR data to investigate the annual cycle of precipitation at this location.

Section 2.2: Calculate the Climatology

As a refresher, a climatology typically employs a 30-year time period for the calculation. In this case, let’s use the reference period of 1981-2010.

# first, let's extract the data for the time period that we want (1981-2010)
precip_30yr = ds.precip.sel(time=slice("1981-01-01", "2010-12-30"))
precip_30yr

Now we can use Xarray’s .groupby() functionality to calculate the monthly climatology.

Recall that .groupby() splits the data based on a specific criterion (in this case, the month of the year) and then applies a process (in our case, calculating the mean value across 30 years for that specific month) to each group before recombining the data together.

# use .groupby() to calculate monthly climatology (1981-2010)
precip_clim = precip_30yr.groupby("time.month").mean(dim="time")
precip_clim

With the resulting climatology data array, we can make a set of maps to visualize the monthly climatology from four different seasons.

# define the figure and each axis for the 2 rows and 2 columns
fig, axs = plt.subplots(
    nrows=2, ncols=2, subplot_kw={"projection": ccrs.Robinson()}, figsize=(12, 8)
)

# axs is a 2-dimensional array of `GeoAxes`.  We will flatten it into a 1-D array to loop over it
axs = axs.flatten()

# loop over selected months (Jan, Apr, Jul, Oct)
for i, month in enumerate([1, 4, 7, 10]):

    # Draw the coastlines and major gridline for each subplot
    axs[i].coastlines()
    axs[i].gridlines()

    # Draw the precipitation data
    precip_clim.sel(month=month).plot(
        ax=axs[i],
        transform=ccrs.PlateCarree(),
        vmin=0,
        vmax=15,  # use the same range of max and min value
        cbar_kwargs=dict(shrink=0.5, label="GPCP Climatology\n(mm/day)"),
    )

In the seasonal collection of the climatology map, we can observe a clear pattern of precipitation across the globe. The tropics exhibit a higher amount of precipitation compared to other regions. Additionally, the map illustrates the seasonal patterns of precipitation changes across different regions of the globe, including areas influenced by monsoons.

Now let’s examine the climatology of the location we previously analyzed throughout the entire time series, specifically at (0°N, 0°E).

# initiate plot
fig, ax = plt.subplots()

# select location and add to plot
precip_clim.sel(latitude=0, longitude=0, method="nearest").plot(ax=ax)

# remove the automatically generated title
ax.set_title("")
# add a grid
ax.grid(True)
# edit default axis labels
ax.set_xlabel("Time (months)")
ax.set_ylabel("GPCP Monthly Precipitation \n(mm/day)")

Text(0, 0.5, 'GPCP Monthly Precipitation \n(mm/day)')

Coding Exercises 2.2

As climate changes, the climatology of precipitation may also change. In fact, climate researchers recalculate a climatology every 10 years. This allows them to monitor how the norms of our climate system change. In this exercise, you will visualize how the climatology of our dataset changes depending on the reference period used.

Calculate the climatology for a different reference period (1991-2020) and compare it to the climatology that we just generated with the reference period (1981-2010). Be sure to compare the two and note the differences. Can you see why it is important to re-calculate this climatology?

# extract 30 year data for 1991-2020
precip_30yr_exercise = ...

# calculate climatology for 1991-2020
precip_clim_exercise = ...

# find difference in climatologies: (1981-2010) minus (1991-2020)
precip_diff_exercise = ...

# Compare the climatology for four different seasons by generating the
#         difference maps for January, April, July, and October with colorbar max and min = 1,-1

# Define the figure and each axis for the 2 rows and 2 columns
fig, axs = plt.subplots(
    nrows=2, ncols=2, subplot_kw={"projection": ccrs.Robinson()}#, figsize=(12, 8)
)

# axs is a 2-dimensional array of `GeoAxes`. We will flatten it into a 1-D array
axs = ...

# Loop over selected months (Jan, Apr, Jul, Oct)
for i, month in enumerate([1, 4, 7, 10]):
    ...

  Cell In[81], line 27
    )
    ^
SyntaxError: unmatched ')'

Click for solution

Summary for calculating climatologies

Climatologies provide valuable insight into the typical weather patterns of a region. Key takeaways from this tutorial include:

A climatology pertains to the long-term average of various system attributes, such as temperature and precipitation, often spanning 30 years.
Satellite climate data records offer valuable insights for calculating climatologies on a global scale.

By comparing the weather conditions of a specific day or month to the climatology, we can determine the extent to which they deviate from the norm. This concept of comparing against the climatology, or the norm, will be the focus of our next tutorial - the anomaly!

Section 3: From Climatology to Anomaly

Building upon your knowledge of climatology from the last tutorial, Tutorial 4, you will now calculate the anomalies from this climatology. An anomaly, in the context of climate studies, represents a departure from standard climatological conditions. For example, if the normal January temperature of the city that you live in is 10°C and the January temperature of this year is 15°C. We usually say the temperature anomaly of January this year is 5°C above normal/ the climatology. The anomaly is an essential tool in detecting changes in climate patterns and is frequently utilized in critical climate reports such as those generated by the Intergovernmental Panel on Climate Change (IPCC).

To calculate an anomaly, we first establish a reference period, usually a 30-year window, to define our climatology. In this process, it is crucial to use high-quality data and aggregate it to the desired spatial resolution and temporal frequency, such as weekly or monthly. The anomaly is then determined by subtracting this long-term average from a given observation, thus creating a useful metric for further climate analysis such as trend analysis.

In this tutorial, we will employ the GPCP monthly precipitation Climate Data Record (CDR) to compute a monthly anomaly time series. Furthermore, we will learn to calculate the rolling mean of the generated precipitation anomaly time series. This knowledge will be invaluable for our upcoming tutorial on climate variability.

Section 3.1: Calculating the Monthly Anomaly

To calculate the anomaly, you first need to calculate the monthly climatology. We have already done this above using .sel() and .groupby().

Just to make sure, we repeat this here. Have a look at the code below to make sure that you understand each part of the command.

# calculate climatology using `.sel()` and `.groupby()` directly.
precip_clim = (
    ds.precip.sel(time=slice("1981-01-01", "2010-12-01"))
    .groupby("time.month")
    .mean(dim="time")
)
precip_clim

Now we have the monthly precipitation climatology. How can we calculate the monthly anomaly?

As we learned before - let’s use .groupby() from xarray. We can split the entire time period based on the month of the year and then subtract the climatology of that specific month from the monthly value and recombine the value together.

# use `.groupby()` to calculate the monthly anomaly
precip_anom = ds.precip.groupby("time.month") - precip_clim
precip_anom

You may have noticed that there is an additional coordinate in the anomaly dataset. The additional coordinate is month which is a direct outcome because of the .groupby() action we just performed.

If you want to save the data for future use, you can write the data out to a netCDF file using .to_netcdf(). It will automatically carry all the coordinates, dimensions, and relevant information into the netCDF file.

# an example of how to export the GPCP monthly anomaly data comparing to the climatology period of 1981-2010.
# precip_anom.to_netcdf('../Data/gpcp-monthly-anomaly_1981-2010.nc')

Section 3.2: Examining the Anomaly

First, let’s take a look at the geospatial pattern of the monthly anomaly of a selected month – January of 1979.

# initate plot
fig = plt.figure()

# set map projection
ax = plt.axes(projection=ccrs.Robinson())

# add coastal lines to indicate land/ocean
ax.coastlines()

# add grid lines for latitude and longitude
ax.gridlines()

# draw the precipitation data
precip_anom.sel(time="1979-01-01").plot(
    ax=ax,
    transform=ccrs.PlateCarree(),
    vmin=-8,
    vmax=8,
    cmap="BrBG",
    cbar_kwargs=dict(shrink=0.5, label="Monthly anomaly \n(mm/day)"),
)

To visualize the changes in the precipitation anomaly over many months, we can also take a look at the time series of a selected grid. We will use the same point (0°N, 0°E) that we used as an example in part 2.

# set up two subplots that share the x-axis to compare monthly precipitation and monthly anomaly
fig, axs = plt.subplots(2, sharex=True)

# draw monthly precipitation
axs[0].plot(ds.time, ds.precip.sel(latitude=0, longitude=0, method="nearest"))

# set title and y-label
fig.suptitle("GPCP Monthly Precipitaion v.s. Monthly Anomaly")
axs[0].set_ylabel("Monthly precipitation\n(mm/day)")

# draw anomaly
axs[1].plot(
    precip_anom.time, precip_anom.sel(latitude=0, longitude=0, method="nearest")
)
# aesthetics
axs[1].set_xlabel("Time (months)")
axs[1].set_ylabel("Precipitation anomaly\n(mm/day)")

# add horizontal line of y=0 for the anomaly subplot
axs[1].axhline(y=0, color="k", linestyle="-")
# add grids
_ = [ax.grid(True) for ax in axs]

Note that, unlike the upper panel showing precipitation values, the lower panel displaying the monthly anomaly does not exhibit distinct seasonal cycles. This discrepancy highlights one of the advantages of utilizing anomaly data for climate analysis. By removing the repetitive patterns induced by seasonality or other stable factors, we can effectively isolate the specific signals in the data that are of interest, such as climate variability or climate trends. This approach allows for a clearer focus on the desired climate-related patterns without the interference of predictable seasonal variations.

Section 4: Anomaly Analysis

In this section, we are going to explore a few different analyses of the anomaly data:

Calculating the rolling mean
Calculating the global mean

You have already practiced using these tools during the last two days or material, here we will focus on applying them to a much longer satellite data record than you have encountered previously.

Section 4.1: Rolling Mean

The monthly anomaly time series often contains noisy data that may obscure the patterns associated with large-scale climate variability. To mitigate this noise and enhance the visibility of underlying patterns, we can apply a rolling mean technique using the .rolling() method. This approach involves smoothing the monthly time series to facilitate the identification of climate variability.

# calculate 12-month rolling mean for the selected location, add in .compute() at the end if using Google Colab if it throws an error
grid_month = precip_anom.sel(latitude=0, longitude=0, method="nearest")
grid_rolling = grid_month.rolling(time=12, center=True).mean()

# create the time series plot of monthly anomaly
fig, ax = plt.subplots()
grid_month.plot(label="Monthly anomaly", ax=ax)
grid_rolling.plot(color="k", label="12-mon rolling mean", ax=ax)

# aesthetics
ax.axhline(y=0, color="y", linestyle="-")
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation anomaly (mm/day)")
ax.legend()
ax.set_title("") # remove the automatically generated title
ax.grid(True)

As you can see, the 12-month rolling mean removes the high-frequency variations of monthly precipitation anomaly data, allowing the slower-changing patterns of precipitation to become more apparent.

Coding Exercises 4.1

Calculate the 24-month rolling mean for the same grid and compare the three different time series (monthly anomaly, 12-month rolling mean, 24-month rolling mean).

# calculate 24-month rolling mean
grid_rolling_24m = ...

# plot all three time series together with different colors
fig, ax = plt.subplots()
_ = grid_month.plot(label="Monthly anomaly", ax=ax)
_ = ...
_ = ...

# aesthetics
ax.axhline(y=0, color="y", linestyle="-")
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation anomaly (mm/day)")
ax.legend()
ax.set_title("") # remove the automatically generated title
ax.grid(True)

Click for solution

Section 4.2: Global Mean

When examining global-scale changes, it is common to aggregate global mean values from all grid cells. However, it is important to note that despite each grid having the same resolution of 2.5°×2.5°, they represent different areas on the Earth’s surface. Specifically, the same grid covers larger spatial areas in the tropics compared to the polar regions as discussed in the climate system overview day (W1D1).

To address this issue, it is necessary to weigh the values based on their respective surface areas. Unlike the model data you used previously, where you had the grid cell area available as a variable, for our gridded observations we will use weights based on the cosine of the latitude as this function takes into account the decreasing area towards the poles.

# calculate the weights using the latitude coordinates
weights = np.cos(np.deg2rad(precip_anom.latitude))
weights.name = "weights"
weights

Using the calculated weights, we can use xarray’s .weighted() method to calculate a weighted mean.

# calculate weighted global monthly mean
anom_weighted = precip_anom.weighted(weights)
global_weighted_mean = anom_weighted.mean(("latitude", "longitude"))
global_weighted_mean

# create the time series plot of global weighted monthly anomaly
fig, ax = plt.subplots()
global_weighted_mean.plot(label="Monthly anomaly", ax=ax)
global_weighted_mean.rolling(time=12, center=True).mean(("latitude", "longitude")).plot(
    color="k", label="12-mon rolling mean", ax=ax
)
ax.axhline(y=0, color="y", linestyle="-")
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation anomaly (mm/day)")
ax.legend()
ax.grid(True)

# calculate unweighted global mean
global_unweighted_mean = ...

# calculate the difference between weighted and unweighted global mean
global_diff = ...

# plot the time series of the difference
fig, ax = plt.subplots()
_ = global_weighted_mean.plot(label="Monthly anomaly", ax=ax)
_ = ...

# aesthetics
ax.axhline(y=0, color="y", linestyle="-")
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation anomaly (mm/day)")
ax.legend()
ax.grid(True)

Click for solution

Summary: Anomalies and weighted means

In this tutorial, you learned how to calculate a climate anomaly using satellite-derived precipitation data.

The anomaly allows us to look at the signals that may be covered by the seasonal cycle pattern (e.g., temperature/precipitation seasonal cycle).
The anomaly data can be further smoothed using the rolling mean to reveal longer-term signals at an annual or decade time scale.

We will use the anomaly concept to study climate variability in the next tutorial.

Footnotes

because it fetches your data↩︎