Working with timeseries data

import pandas as pd

Reading timeseries data: Weather station data

In this example, we will se NOAA weather station data that I downloaded from Climate Data Online. We are using data from Charlottesville-Albemarle Airport. In this particular daily summaries are archived.

We read the data like in previous examples. With no options, this is what we get.

datafile = r'../Data/CDO_USW00093736_1955_2025.txt'
df = pd.read_csv(datafile)
df.head()
STATION           ELEVATION  LATITUDE   LONGITUDE  DATE     PRCP     SNWD     SNOW     TAVG     TMAX     TMIN     AWND     WDF2     WDF5     WSF2     WSF5
0 ----------------- ---------- ---------- ------...
1 GHCND:USW00093736      194.8      38.14     -7...
2 GHCND:USW00093736      194.8      38.14     -7...
3 GHCND:USW00093736      194.8      38.14     -7...
4 GHCND:USW00093736      194.8      38.14     -7...

Pandas did not separate the data into different columns. This is because it was expecting a standard csv file with comma separated values. However, in our file the data is separated by multiple whitespaces. We can let pandas know by using the sep keyword. \s indicates a whitespace and the + is for multiple.

We also see that the first row in the dataset does not contain any data. We use the skiprows keyword to indicate that the second-row (1st when counting from zero) does not have any data. As a standard, pandas assumes that the first row contains the column-headers.

df = pd.read_csv(datafile, sep=r'\s+', skiprows= [1])
df.head()
STATION ELEVATION LATITUDE LONGITUDE DATE PRCP SNWD SNOW TAVG TMAX TMIN AWND WDF2 WDF5 WSF2 WSF5
0 GHCND:USW00093736 194.8 38.14 -78.45 19560101 -9999.0 -9999.0 -9999.0 -9999.0 10.0 -1.7 -9999.0 -9999 -9999 -9999.0 -9999.0
1 GHCND:USW00093736 194.8 38.14 -78.45 19560102 -9999.0 -9999.0 -9999.0 -9999.0 4.4 -3.9 -9999.0 -9999 -9999 -9999.0 -9999.0
2 GHCND:USW00093736 194.8 38.14 -78.45 19560103 -9999.0 -9999.0 -9999.0 -9999.0 12.8 0.6 -9999.0 -9999 -9999 -9999.0 -9999.0
3 GHCND:USW00093736 194.8 38.14 -78.45 19560104 -9999.0 -9999.0 -9999.0 -9999.0 8.9 -2.8 -9999.0 -9999 -9999 -9999.0 -9999.0
4 GHCND:USW00093736 194.8 38.14 -78.45 19560105 -9999.0 -9999.0 -9999.0 -9999.0 11.7 0.0 -9999.0 -9999 -9999 -9999.0 -9999.0

This looks much better. We also see that there are lots of -9999 and -9999.0 values. The CDO documentation tells us that these are used to identify missing data, so we can let pandas know.

df = pd.read_csv(datafile, sep=r'\s+', skiprows= [1], na_values= [-9999, -9999.0])
df.head()
STATION ELEVATION LATITUDE LONGITUDE DATE PRCP SNWD SNOW TAVG TMAX TMIN AWND WDF2 WDF5 WSF2 WSF5
0 GHCND:USW00093736 194.8 38.14 -78.45 19560101 NaN NaN NaN NaN 10.0 -1.7 NaN NaN NaN NaN NaN
1 GHCND:USW00093736 194.8 38.14 -78.45 19560102 NaN NaN NaN NaN 4.4 -3.9 NaN NaN NaN NaN NaN
2 GHCND:USW00093736 194.8 38.14 -78.45 19560103 NaN NaN NaN NaN 12.8 0.6 NaN NaN NaN NaN NaN
3 GHCND:USW00093736 194.8 38.14 -78.45 19560104 NaN NaN NaN NaN 8.9 -2.8 NaN NaN NaN NaN NaN
4 GHCND:USW00093736 194.8 38.14 -78.45 19560105 NaN NaN NaN NaN 11.7 0.0 NaN NaN NaN NaN NaN

Great. The missing data is now represented by NaN.

What data types did pandas infer?

df.info()
<class 'pandas.DataFrame'>
RangeIndex: 25554 entries, 0 to 25553
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   STATION    25554 non-null  str    
 1   ELEVATION  25554 non-null  float64
 2   LATITUDE   25554 non-null  float64
 3   LONGITUDE  25554 non-null  float64
 4   DATE       25554 non-null  int64  
 5   PRCP       23343 non-null  float64
 6   SNWD       5958 non-null   float64
 7   SNOW       5970 non-null   float64
 8   TAVG       2831 non-null   float64
 9   TMAX       25498 non-null  float64
 10  TMIN       25496 non-null  float64
 11  AWND       9914 non-null   float64
 12  WDF2       9919 non-null   float64
 13  WDF5       9899 non-null   float64
 14  WSF2       9920 non-null   float64
 15  WSF5       9901 non-null   float64
dtypes: float64(14), int64(1), str(1)
memory usage: 3.5 MB

Do you notice a problem. The DATE column was read as an integer, which means that pandas did not understand that this is a date.

So, January 1st 1956 became 19560101

We need to explicitly tell pandas, which columns should be understood as dates:

df = pd.read_csv(datafile, sep=r'\s+',
                 na_values=[-9999.0, -99.0],
                 skiprows = [1],
                 parse_dates=[4])
df.info()
<class 'pandas.DataFrame'>
RangeIndex: 25554 entries, 0 to 25553
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   STATION    25554 non-null  str           
 1   ELEVATION  25554 non-null  float64       
 2   LATITUDE   25554 non-null  float64       
 3   LONGITUDE  25554 non-null  float64       
 4   DATE       25554 non-null  datetime64[us]
 5   PRCP       23343 non-null  float64       
 6   SNWD       5958 non-null   float64       
 7   SNOW       5970 non-null   float64       
 8   TAVG       2831 non-null   float64       
 9   TMAX       25498 non-null  float64       
 10  TMIN       25496 non-null  float64       
 11  AWND       9914 non-null   float64       
 12  WDF2       9919 non-null   float64       
 13  WDF5       9899 non-null   float64       
 14  WSF2       9920 non-null   float64       
 15  WSF5       9901 non-null   float64       
dtypes: datetime64[us](1), float64(14), str(1)
memory usage: 3.5 MB

That worked.

Time indices

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great!

We now set the date column as the index for our dataframe.

df = df.set_index('DATE')
df.head()
STATION ELEVATION LATITUDE LONGITUDE PRCP SNWD SNOW TAVG TMAX TMIN AWND WDF2 WDF5 WSF2 WSF5
DATE
1956-01-01 GHCND:USW00093736 194.8 38.14 -78.45 NaN NaN NaN NaN 10.0 -1.7 NaN NaN NaN NaN NaN
1956-01-02 GHCND:USW00093736 194.8 38.14 -78.45 NaN NaN NaN NaN 4.4 -3.9 NaN NaN NaN NaN NaN
1956-01-03 GHCND:USW00093736 194.8 38.14 -78.45 NaN NaN NaN NaN 12.8 0.6 NaN NaN NaN NaN NaN
1956-01-04 GHCND:USW00093736 194.8 38.14 -78.45 NaN NaN NaN NaN 8.9 -2.8 NaN NaN NaN NaN NaN
1956-01-05 GHCND:USW00093736 194.8 38.14 -78.45 NaN NaN NaN NaN 11.7 0.0 NaN NaN NaN NaN NaN

The TimeIndex object has lots of useful attributes. Below are some examples:

df.index.month
Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       ...
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      dtype='int32', name='DATE', length=25554)
df.index.year
Index([1956, 1956, 1956, 1956, 1956, 1956, 1956, 1956, 1956, 1956,
       ...
       2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026],
      dtype='int32', name='DATE', length=25554)
df.index.day_of_week
Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       ...
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4],
      dtype='int32', name='DATE', length=25554)

We can also use the time index to calculate the length of a period.

df.index[10] - df.index[0] 
Timedelta('10 days 00:00:00')
(df.index[10] - df.index[0]).total_seconds() 
864000.0

Accessing data through the time index

We can now access values by time:

df.loc['2017-08-07']
STATION      GHCND:USW00093736
ELEVATION                192.3
LATITUDE               38.1374
LONGITUDE            -78.45513
PRCP                      26.7
SNWD                       NaN
SNOW                       NaN
TAVG                       NaN
TMAX                      25.6
TMIN                      21.1
AWND                       1.5
WDF2                      30.0
WDF5                     320.0
WSF2                       5.8
WSF5                       7.2
Name: 2017-08-07 00:00:00, dtype: object

Or use slicing to get a range:

df.loc['2017-07-01':'2017-07-31']
STATION ELEVATION LATITUDE LONGITUDE PRCP SNWD SNOW TAVG TMAX TMIN AWND WDF2 WDF5 WSF2 WSF5
DATE
2017-07-01 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 34.4 23.3 2.7 190.0 120.0 6.7 8.9
2017-07-02 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 31.7 20.6 1.5 200.0 210.0 5.8 8.1
2017-07-03 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 33.9 19.4 0.9 210.0 210.0 5.4 6.3
2017-07-04 GHCND:USW00093736 192.3 38.1374 -78.45513 0.3 NaN NaN NaN 32.2 21.1 0.8 210.0 200.0 4.0 5.4
2017-07-05 GHCND:USW00093736 192.3 38.1374 -78.45513 5.3 NaN NaN NaN 27.8 22.2 2.0 80.0 60.0 6.3 8.1
2017-07-06 GHCND:USW00093736 192.3 38.1374 -78.45513 0.5 NaN NaN NaN 31.7 21.7 1.3 210.0 60.0 5.8 6.7
2017-07-07 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 33.3 22.8 1.7 200.0 210.0 5.8 8.5
2017-07-08 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 32.8 20.0 1.6 320.0 330.0 5.8 8.1
2017-07-09 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 31.1 17.2 0.9 160.0 160.0 5.4 6.3
2017-07-10 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 33.9 17.2 2.0 190.0 190.0 6.7 9.4
2017-07-11 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 31.1 20.6 2.2 200.0 190.0 5.8 7.2
2017-07-12 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 36.1 22.8 1.6 260.0 260.0 5.4 6.7
2017-07-13 GHCND:USW00093736 192.3 38.1374 -78.45513 0.5 NaN NaN NaN 36.7 22.8 2.0 230.0 210.0 7.2 9.4
2017-07-14 GHCND:USW00093736 192.3 38.1374 -78.45513 22.6 NaN NaN NaN 37.8 23.9 2.3 340.0 320.0 10.3 17.0
2017-07-15 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 33.3 22.8 1.6 340.0 340.0 4.5 6.7
2017-07-16 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 32.8 20.6 1.5 120.0 140.0 4.0 6.3
2017-07-17 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 32.8 20.6 1.4 170.0 180.0 7.6 9.4
2017-07-18 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 35.0 20.0 1.5 150.0 150.0 7.6 10.3
2017-07-19 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 35.6 18.9 0.9 100.0 350.0 4.0 5.4
2017-07-20 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 36.7 21.7 1.2 190.0 290.0 5.8 8.9
2017-07-21 GHCND:USW00093736 192.3 38.1374 -78.45513 5.8 NaN NaN NaN 36.7 23.3 1.1 320.0 310.0 8.9 14.8
2017-07-22 GHCND:USW00093736 192.3 38.1374 -78.45513 1.3 NaN NaN NaN 36.7 22.8 2.1 50.0 350.0 7.6 10.3
2017-07-23 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 34.4 23.3 1.5 200.0 190.0 9.4 13.0
2017-07-24 GHCND:USW00093736 192.3 38.1374 -78.45513 9.4 NaN NaN NaN 33.3 22.8 2.6 190.0 160.0 11.6 14.3
2017-07-25 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 30.0 18.9 2.7 20.0 40.0 6.7 8.5
2017-07-26 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 30.0 16.7 1.0 200.0 180.0 5.4 7.6
2017-07-27 GHCND:USW00093736 192.3 38.1374 -78.45513 0.5 NaN NaN NaN 31.1 20.0 1.4 210.0 190.0 5.4 6.7
2017-07-28 GHCND:USW00093736 192.3 38.1374 -78.45513 51.1 NaN NaN NaN 30.0 22.2 1.2 210.0 200.0 5.4 6.3
2017-07-29 GHCND:USW00093736 192.3 38.1374 -78.45513 0.3 NaN NaN NaN 26.7 17.2 3.8 40.0 20.0 10.3 12.5
2017-07-30 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 28.9 13.3 1.3 30.0 50.0 5.4 6.7
2017-07-31 GHCND:USW00093736 192.3 38.1374 -78.45513 0.0 NaN NaN NaN 31.1 15.0 0.8 70.0 90.0 4.5 5.8

Plotting data

Pandas is very “time aware” and will format the axis of our plot accordingly.

df.loc['2017-01-01':'2017-12-31', 'TMAX'].plot()

Aggregating data

This timeseries has daily resolution, and the daily plots can be very noisy.

df.plot(y='TMAX')   

Aggregating our data will help us reduce the noise.

Resampling data

A common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

Below we resample the dataset by getting the data for every year. YS stands for year start. A full list of these so-called offset aliases is found in the pandas timeseries documentation.

yearly_df=df[['TMIN', 'TMAX']].resample('YS').mean()
yearly_df.head()
TMIN TMAX
DATE
1956-01-01 9.924501 19.512821
1957-01-01 10.489779 19.036464
1958-01-01 8.379614 17.766391
1959-01-01 9.181096 19.614247
1960-01-01 8.561749 18.340984
yearly_df.plot()

This looks much less noisy, and we could for example now think about whether there is a trend in our dataset.

Groupby

We have already seen how powerful the .groupby() operation is.

A common way to analyze weather data in climate science is to create a “climatology,” which contains the average values in each month or day of the year. We can do this easily with .groupby(). Recall that df.index is a pandas DateTimeIndex object.

monthly_climatology  = df[['TMIN', 'TMAX']].groupby(df.index.month).mean()
monthly_climatology 
TMIN TMAX
DATE
1 -2.707156 7.235597
2 -1.510136 9.175895
3 2.457680 14.229545
4 7.643440 20.404322
5 12.485556 24.526753
6 17.079532 28.774678
7 19.510906 30.701017
8 18.657407 29.771930
9 14.877751 26.253037
10 8.311281 20.599676
11 3.371483 14.821412
12 -0.699630 9.224699

Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)

We can apply more customized aggregations. These can be specified using .aggregate(). Below we keep the max of the TMAX column, and min of TMIN column for the temperature measurements.

monthly_T_climatology = df.groupby(df.index.month).aggregate({'TMAX': 'max',
                                                              'TMIN': 'min'})
monthly_T_climatology.head()
TMAX TMIN
DATE
1 27.2 -21.1
2 27.8 -19.4
3 32.8 -17.1
4 34.4 -5.6
5 36.1 -1.7
monthly_T_climatology.plot(marker='o')

If we want to do it on a finer scale, we can group by day of year

daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'TMAX': 'max',
                                                              'TMIN': 'min'})
daily_T_climatology.plot(marker='.')

Missing data and gapfilling

Because timeseries data is ordered and can show periodic variation (e.g. seasonal, daily, annual) dealing with missing data is very important.

Take precipitation data for example.

df['PRCP'].plot(kind='hist',ylabel='Frequency', xlabel='Precipitation (mm)', bins=20, grid = True)

Because rainfall is intermittent and the majority of days are dry, rainfall is typically reported as annual or monthly Let’s look at the period between 1980 and 2000.

df_20_years = df.loc['1980':'2000']
df_20_years.groupby(df_20_years.index.year)['PRCP'].sum().plot(kind='bar',xlabel='Year', ylabel='Total Precipitation (mm)', title='Total Precipitation by Year (1980-2000)')   

Using the data as is, it appears that 1988 and 1994 were unusually dry.

We can check the same period for missing data. The .isna() method returns True when data is missing (i.e. NaN). We can also use the feature that True is counted as 1 and False as 0.

So the below code, produces a series with True for missing days and False for days with data. The Trues are then summed up.

df_20_years['PRCP'].isna().groupby(df_20_years.index.year).sum().plot(grid=True, ylabel='Number of Missing Days', title='Missing Data by Year (1980-2000)')

Pandas has built in methods for filling in missing data. The simplest is to fill in a constant value, such as the mean, median, or zero.

For example, we can fill in missing precipitation values with the average rainfall

df_20_years['PRCP'].mean()
np.float64(2.4337458854509544)
df_20_years_filled = df_20_years['PRCP'].fillna(df_20_years['PRCP'].mean()) 
df_20_years_filled.groupby(df_20_years_filled.index.year).sum().plot(kind='bar',xlabel='Year', ylabel='Total Precipitation (mm)', title='Total Precipitation by Year (1980-2000) Gapfilled with Mean')   

This did not change the result much. Gapfilling is difficult without knowing the dataset very well. For example, the missing data in 1998 could have missed by chance some of the largest rainshowers of the year so that adding the average rainfall makes little different.

Pandas offers several ways of dealing with missing data.

  • Interpolation: df.interpolate(method = 'linear', ...) fills NaN values using an interpolation method. There are several methods available including linear, splines, …
  • Forward fill: df.ffill() fills NaN values by propagating the last valid observation to next valid.
  • Backward fill: df.bfill()
  • Remove rows with NaN: df.dropna() simply removes any rows that contain NaN values. This is not recommended for time series data.

In reality each of these methods will only work for small numbers of missing data.