Working with timeseries data – ISAT 420: Environmental Analysis & Modeling

Reading timeseries data: Weather station data

In this example, we will se NOAA weather station data that I downloaded from Climate Data Online. We are using data from Charlottesville-Albemarle Airport. In this particular daily summaries are archived.

We read the data like in previous examples. With no options, this is what we get.

datafile = r'../Data/CDO_USW00093736_1955_2025.txt'
df = pd.read_csv(datafile)
df.head()

	STATION ELEVATION LATITUDE LONGITUDE DATE PRCP SNWD SNOW TAVG TMAX TMIN AWND WDF2 WDF5 WSF2 WSF5
0	----------------- ---------- ---------- ------...
1	GHCND:USW00093736 194.8 38.14 -7...
2	GHCND:USW00093736 194.8 38.14 -7...
3	GHCND:USW00093736 194.8 38.14 -7...
4	GHCND:USW00093736 194.8 38.14 -7...

Pandas did not separate the data into different columns. This is because it was expecting a standard csv file with comma separated values. However, in our file the data is separated by multiple whitespaces. We can let pandas know by using the sep keyword. \s indicates a whitespace and the + is for multiple.

We also see that the first row in the dataset does not contain any data. We use the skiprows keyword to indicate that the second-row (1st when counting from zero) does not have any data. As a standard, pandas assumes that the first row contains the column-headers.

df = pd.read_csv(datafile, sep=r'\s+', skiprows= [1])
df.head()

	STATION	ELEVATION	LATITUDE	LONGITUDE	DATE	PRCP	SNWD	SNOW	TAVG	TMAX	TMIN	AWND	WDF2	WDF5	WSF2	WSF5
0	GHCND:USW00093736	194.8	38.14	-78.45	19560101	-9999.0	-9999.0	-9999.0	-9999.0	10.0	-1.7	-9999.0	-9999	-9999	-9999.0	-9999.0
1	GHCND:USW00093736	194.8	38.14	-78.45	19560102	-9999.0	-9999.0	-9999.0	-9999.0	4.4	-3.9	-9999.0	-9999	-9999	-9999.0	-9999.0
2	GHCND:USW00093736	194.8	38.14	-78.45	19560103	-9999.0	-9999.0	-9999.0	-9999.0	12.8	0.6	-9999.0	-9999	-9999	-9999.0	-9999.0
3	GHCND:USW00093736	194.8	38.14	-78.45	19560104	-9999.0	-9999.0	-9999.0	-9999.0	8.9	-2.8	-9999.0	-9999	-9999	-9999.0	-9999.0
4	GHCND:USW00093736	194.8	38.14	-78.45	19560105	-9999.0	-9999.0	-9999.0	-9999.0	11.7	0.0	-9999.0	-9999	-9999	-9999.0	-9999.0

This looks much better. We also see that there are lots of -9999 and -9999.0 values. The CDO documentation tells us that these are used to identify missing data, so we can let pandas know.

df = pd.read_csv(datafile, sep=r'\s+', skiprows= [1], na_values= [-9999, -9999.0])
df.head()

	STATION	ELEVATION	LATITUDE	LONGITUDE	DATE	PRCP	SNWD	SNOW	TAVG	TMAX	TMIN	AWND	WDF2	WDF5	WSF2	WSF5
0	GHCND:USW00093736	194.8	38.14	-78.45	19560101	NaN	NaN	NaN	NaN	10.0	-1.7	NaN	NaN	NaN	NaN	NaN
1	GHCND:USW00093736	194.8	38.14	-78.45	19560102	NaN	NaN	NaN	NaN	4.4	-3.9	NaN	NaN	NaN	NaN	NaN
2	GHCND:USW00093736	194.8	38.14	-78.45	19560103	NaN	NaN	NaN	NaN	12.8	0.6	NaN	NaN	NaN	NaN	NaN
3	GHCND:USW00093736	194.8	38.14	-78.45	19560104	NaN	NaN	NaN	NaN	8.9	-2.8	NaN	NaN	NaN	NaN	NaN
4	GHCND:USW00093736	194.8	38.14	-78.45	19560105	NaN	NaN	NaN	NaN	11.7	0.0	NaN	NaN	NaN	NaN	NaN

Great. The missing data is now represented by NaN.

What data types did pandas infer?

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 25554 entries, 0 to 25553
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   STATION    25554 non-null  str    
 1   ELEVATION  25554 non-null  float64
 2   LATITUDE   25554 non-null  float64
 3   LONGITUDE  25554 non-null  float64
 4   DATE       25554 non-null  int64  
 5   PRCP       23343 non-null  float64
 6   SNWD       5958 non-null   float64
 7   SNOW       5970 non-null   float64
 8   TAVG       2831 non-null   float64
 9   TMAX       25498 non-null  float64
 10  TMIN       25496 non-null  float64
 11  AWND       9914 non-null   float64
 12  WDF2       9919 non-null   float64
 13  WDF5       9899 non-null   float64
 14  WSF2       9920 non-null   float64
 15  WSF5       9901 non-null   float64
dtypes: float64(14), int64(1), str(1)
memory usage: 3.5 MB

Do you notice a problem. The DATE column was read as an integer, which means that pandas did not understand that this is a date.

So, January 1st 1956 became 19560101

We need to explicitly tell pandas, which columns should be understood as dates:

df = pd.read_csv(datafile, sep=r'\s+',
                 na_values=[-9999.0, -99.0],
                 skiprows = [1],
                 parse_dates=[4])
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 25554 entries, 0 to 25553
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   STATION    25554 non-null  str           
 1   ELEVATION  25554 non-null  float64       
 2   LATITUDE   25554 non-null  float64       
 3   LONGITUDE  25554 non-null  float64       
 4   DATE       25554 non-null  datetime64[us]
 5   PRCP       23343 non-null  float64       
 6   SNWD       5958 non-null   float64       
 7   SNOW       5970 non-null   float64       
 8   TAVG       2831 non-null   float64       
 9   TMAX       25498 non-null  float64       
 10  TMIN       25496 non-null  float64       
 11  AWND       9914 non-null   float64       
 12  WDF2       9919 non-null   float64       
 13  WDF5       9899 non-null   float64       
 14  WSF2       9920 non-null   float64       
 15  WSF5       9901 non-null   float64       
dtypes: datetime64[us](1), float64(14), str(1)
memory usage: 3.5 MB

That worked.

Time indices

Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great!

We now set the date column as the index for our dataframe.

df = df.set_index('DATE')
df.head()

	STATION	ELEVATION	LATITUDE	LONGITUDE	PRCP	SNWD	SNOW	TAVG	TMAX	TMIN	AWND	WDF2	WDF5	WSF2	WSF5
DATE
1956-01-01	GHCND:USW00093736	194.8	38.14	-78.45	NaN	NaN	NaN	NaN	10.0	-1.7	NaN	NaN	NaN	NaN	NaN
1956-01-02	GHCND:USW00093736	194.8	38.14	-78.45	NaN	NaN	NaN	NaN	4.4	-3.9	NaN	NaN	NaN	NaN	NaN
1956-01-03	GHCND:USW00093736	194.8	38.14	-78.45	NaN	NaN	NaN	NaN	12.8	0.6	NaN	NaN	NaN	NaN	NaN
1956-01-04	GHCND:USW00093736	194.8	38.14	-78.45	NaN	NaN	NaN	NaN	8.9	-2.8	NaN	NaN	NaN	NaN	NaN
1956-01-05	GHCND:USW00093736	194.8	38.14	-78.45	NaN	NaN	NaN	NaN	11.7	0.0	NaN	NaN	NaN	NaN	NaN

The TimeIndex object has lots of useful attributes. Below are some examples:

df.index.month

Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       ...
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      dtype='int32', name='DATE', length=25554)

df.index.year

Index([1956, 1956, 1956, 1956, 1956, 1956, 1956, 1956, 1956, 1956,
       ...
       2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026],
      dtype='int32', name='DATE', length=25554)

df.index.day_of_week

Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
       ...
       2, 3, 4, 5, 6, 0, 1, 2, 3, 4],
      dtype='int32', name='DATE', length=25554)

We can also use the time index to calculate the length of a period.

df.index[10] - df.index[0]

Timedelta('10 days 00:00:00')

(df.index[10] - df.index[0]).total_seconds()

864000.0

Accessing data through the time index

We can now access values by time:

df.loc['2017-08-07']

STATION      GHCND:USW00093736
ELEVATION                192.3
LATITUDE               38.1374
LONGITUDE            -78.45513
PRCP                      26.7
SNWD                       NaN
SNOW                       NaN
TAVG                       NaN
TMAX                      25.6
TMIN                      21.1
AWND                       1.5
WDF2                      30.0
WDF5                     320.0
WSF2                       5.8
WSF5                       7.2
Name: 2017-08-07 00:00:00, dtype: object

Or use slicing to get a range:

df.loc['2017-07-01':'2017-07-31']

	STATION	ELEVATION	LATITUDE	LONGITUDE	PRCP	SNWD	SNOW	TAVG	TMAX	TMIN	AWND	WDF2	WDF5	WSF2	WSF5
DATE
2017-07-01	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	34.4	23.3	2.7	190.0	120.0	6.7	8.9
2017-07-02	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	31.7	20.6	1.5	200.0	210.0	5.8	8.1
2017-07-03	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	33.9	19.4	0.9	210.0	210.0	5.4	6.3
2017-07-04	GHCND:USW00093736	192.3	38.1374	-78.45513	0.3	NaN	NaN	NaN	32.2	21.1	0.8	210.0	200.0	4.0	5.4
2017-07-05	GHCND:USW00093736	192.3	38.1374	-78.45513	5.3	NaN	NaN	NaN	27.8	22.2	2.0	80.0	60.0	6.3	8.1
2017-07-06	GHCND:USW00093736	192.3	38.1374	-78.45513	0.5	NaN	NaN	NaN	31.7	21.7	1.3	210.0	60.0	5.8	6.7
2017-07-07	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	33.3	22.8	1.7	200.0	210.0	5.8	8.5
2017-07-08	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	32.8	20.0	1.6	320.0	330.0	5.8	8.1
2017-07-09	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	31.1	17.2	0.9	160.0	160.0	5.4	6.3
2017-07-10	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	33.9	17.2	2.0	190.0	190.0	6.7	9.4
2017-07-11	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	31.1	20.6	2.2	200.0	190.0	5.8	7.2
2017-07-12	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	36.1	22.8	1.6	260.0	260.0	5.4	6.7
2017-07-13	GHCND:USW00093736	192.3	38.1374	-78.45513	0.5	NaN	NaN	NaN	36.7	22.8	2.0	230.0	210.0	7.2	9.4
2017-07-14	GHCND:USW00093736	192.3	38.1374	-78.45513	22.6	NaN	NaN	NaN	37.8	23.9	2.3	340.0	320.0	10.3	17.0
2017-07-15	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	33.3	22.8	1.6	340.0	340.0	4.5	6.7
2017-07-16	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	32.8	20.6	1.5	120.0	140.0	4.0	6.3
2017-07-17	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	32.8	20.6	1.4	170.0	180.0	7.6	9.4
2017-07-18	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	35.0	20.0	1.5	150.0	150.0	7.6	10.3
2017-07-19	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	35.6	18.9	0.9	100.0	350.0	4.0	5.4
2017-07-20	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	36.7	21.7	1.2	190.0	290.0	5.8	8.9
2017-07-21	GHCND:USW00093736	192.3	38.1374	-78.45513	5.8	NaN	NaN	NaN	36.7	23.3	1.1	320.0	310.0	8.9	14.8
2017-07-22	GHCND:USW00093736	192.3	38.1374	-78.45513	1.3	NaN	NaN	NaN	36.7	22.8	2.1	50.0	350.0	7.6	10.3
2017-07-23	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	34.4	23.3	1.5	200.0	190.0	9.4	13.0
2017-07-24	GHCND:USW00093736	192.3	38.1374	-78.45513	9.4	NaN	NaN	NaN	33.3	22.8	2.6	190.0	160.0	11.6	14.3
2017-07-25	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	30.0	18.9	2.7	20.0	40.0	6.7	8.5
2017-07-26	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	30.0	16.7	1.0	200.0	180.0	5.4	7.6
2017-07-27	GHCND:USW00093736	192.3	38.1374	-78.45513	0.5	NaN	NaN	NaN	31.1	20.0	1.4	210.0	190.0	5.4	6.7
2017-07-28	GHCND:USW00093736	192.3	38.1374	-78.45513	51.1	NaN	NaN	NaN	30.0	22.2	1.2	210.0	200.0	5.4	6.3
2017-07-29	GHCND:USW00093736	192.3	38.1374	-78.45513	0.3	NaN	NaN	NaN	26.7	17.2	3.8	40.0	20.0	10.3	12.5
2017-07-30	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	28.9	13.3	1.3	30.0	50.0	5.4	6.7
2017-07-31	GHCND:USW00093736	192.3	38.1374	-78.45513	0.0	NaN	NaN	NaN	31.1	15.0	0.8	70.0	90.0	4.5	5.8

Plotting data

Pandas is very “time aware” and will format the axis of our plot accordingly.

df.loc['2017-01-01':'2017-12-31', 'TMAX'].plot()

Aggregating data

This timeseries has daily resolution, and the daily plots can be very noisy.

df.plot(y='TMAX')

Aggregating our data will help us reduce the noise.

Resampling data

A common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

Below we resample the dataset by getting the data for every year. YS stands for year start. A full list of these so-called offset aliases is found in the pandas timeseries documentation.

yearly_df=df[['TMIN', 'TMAX']].resample('YS').mean()
yearly_df.head()

	TMIN	TMAX
DATE
1956-01-01	9.924501	19.512821
1957-01-01	10.489779	19.036464
1958-01-01	8.379614	17.766391
1959-01-01	9.181096	19.614247
1960-01-01	8.561749	18.340984

yearly_df.plot()

This looks much less noisy, and we could for example now think about whether there is a trend in our dataset.

Groupby

We have already seen how powerful the .groupby() operation is.

A common way to analyze weather data in climate science is to create a “climatology,” which contains the average values in each month or day of the year. We can do this easily with .groupby(). Recall that df.index is a pandas DateTimeIndex object.

monthly_climatology  = df[['TMIN', 'TMAX']].groupby(df.index.month).mean()
monthly_climatology

	TMIN	TMAX
DATE
1	-2.707156	7.235597
2	-1.510136	9.175895
3	2.457680	14.229545
4	7.643440	20.404322
5	12.485556	24.526753
6	17.079532	28.774678
7	19.510906	30.701017
8	18.657407	29.771930
9	14.877751	26.253037
10	8.311281	20.599676
11	3.371483	14.821412
12	-0.699630	9.224699

Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)

We can apply more customized aggregations. These can be specified using .aggregate(). Below we keep the max of the TMAX column, and min of TMIN column for the temperature measurements.

monthly_T_climatology = df.groupby(df.index.month).aggregate({'TMAX': 'max',
                                                              'TMIN': 'min'})
monthly_T_climatology.head()

	TMAX	TMIN
DATE
1	27.2	-21.1
2	27.8	-19.4
3	32.8	-17.1
4	34.4	-5.6
5	36.1	-1.7

monthly_T_climatology.plot(marker='o')

If we want to do it on a finer scale, we can group by day of year

daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'TMAX': 'max',
                                                              'TMIN': 'min'})
daily_T_climatology.plot(marker='.')

Missing data and gapfilling

Because timeseries data is ordered and can show periodic variation (e.g. seasonal, daily, annual) dealing with missing data is very important.

Take precipitation data for example.

df['PRCP'].plot(kind='hist',ylabel='Frequency', xlabel='Precipitation (mm)', bins=20, grid = True)

Because rainfall is intermittent and the majority of days are dry, rainfall is typically reported as annual or monthly Let’s look at the period between 1980 and 2000.

df_20_years = df.loc['1980':'2000']
df_20_years.groupby(df_20_years.index.year)['PRCP'].sum().plot(kind='bar',xlabel='Year', ylabel='Total Precipitation (mm)', title='Total Precipitation by Year (1980-2000)')

Using the data as is, it appears that 1988 and 1994 were unusually dry.

We can check the same period for missing data. The .isna() method returns True when data is missing (i.e. NaN). We can also use the feature that True is counted as 1 and False as 0.

So the below code, produces a series with True for missing days and False for days with data. The Trues are then summed up.

df_20_years['PRCP'].isna().groupby(df_20_years.index.year).sum().plot(grid=True, ylabel='Number of Missing Days', title='Missing Data by Year (1980-2000)')

Pandas has built in methods for filling in missing data. The simplest is to fill in a constant value, such as the mean, median, or zero.

For example, we can fill in missing precipitation values with the average rainfall

df_20_years['PRCP'].mean()

np.float64(2.4337458854509544)

df_20_years_filled = df_20_years['PRCP'].fillna(df_20_years['PRCP'].mean()) 
df_20_years_filled.groupby(df_20_years_filled.index.year).sum().plot(kind='bar',xlabel='Year', ylabel='Total Precipitation (mm)', title='Total Precipitation by Year (1980-2000) Gapfilled with Mean')

This did not change the result much. Gapfilling is difficult without knowing the dataset very well. For example, the missing data in 1998 could have missed by chance some of the largest rainshowers of the year so that adding the average rainfall makes little different.

Pandas offers several ways of dealing with missing data.

Interpolation: df.interpolate(method = 'linear', ...) fills NaN values using an interpolation method. There are several methods available including linear, splines, …
Forward fill: df.ffill() fills NaN values by propagating the last valid observation to next valid.
Backward fill: df.bfill()
Remove rows with NaN: df.dropna() simply removes any rows that contain NaN values. This is not recommended for time series data.

In reality each of these methods will only work for small numbers of missing data.