Python for Business Analytics

Working with time stamped data


This guide explains the basics of working with dates and times in Python and pandas.

Content:

This notebook assumes the following imports and settings.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Date and Time in Python

The datetime module from the standard Python library provides the basic variable types and tools for date and time data. To get started, we retrieve the current date and time.

In [2]:
from datetime import datetime
now=datetime.now()
now
Out[2]:
datetime.datetime(2016, 9, 13, 19, 7, 24, 80326)
In [3]:
print(now)
2016-09-13 19:07:24.080326

The now variable that we created has a special data type which stores date and time down to microseconds.

In [4]:
type(now)
Out[4]:
datetime.datetime
In [5]:
now.year, now.month, now.day
Out[5]:
(2016, 9, 13)

An interesting feature of datetime objects is that we can perform operations with them.

In [6]:
delta=datetime(2016, 9, 15)-datetime(2015, 9, 12, 6, 10)
delta
Out[6]:
datetime.timedelta(368, 64200)
In [7]:
delta.days, delta.seconds 
Out[7]:
(368, 64200)

Fow example, if we want to shift a date 5 days ahead, we can use:

In [8]:
from datetime import timedelta
start=datetime(2016, 10, 1)
start+timedelta(5)
Out[8]:
datetime.datetime(2016, 10, 6, 0, 0)

As a note, the datetime module also has separate date and time objects.

Conversions between strings and datetime

Datetime objects (and the pandas counterparts) have the strftime method, which allows us to convert them to a string according to our desired format. Refer to this page for the available formatting options.

In [9]:
stamp=datetime(2016,10,1)
stamp.strftime('%Y-%m-%d')
Out[9]:
'2016-10-01'
In [10]:
stamp.strftime('%d/%b/%y')
Out[10]:
'01/Oct/16'

To convert strings to datatime, we can use the parse function, which infers almost any intelligible date format. Here is an example.

In [11]:
from dateutil.parser import parse
parse('Sept 15th 2016')
Out[11]:
datetime.datetime(2016, 9, 15, 0, 0)

We just need to be careful with the fact that it assumes a US date format, unless we specify otherwise.

In [12]:
print(parse('1/10/2016'))
print(parse('1/10/2016', dayfirst=True))
2016-01-10 00:00:00
2016-10-01 00:00:00

Date functionality in pandas

When dealing with multiple dates, we turn to pandas.

In [13]:
dts=['10/10/2016','11/10/2016']
dates=pd.to_datetime(dts, dayfirst=True)
dates
Out[13]:
DatetimeIndex(['2016-10-10', '2016-10-11'], dtype='datetime64[ns]', freq=None)

In pandas, a set of dates has the DatetimeIndex type. Each element of a DatetimeIndex has the TimeStamp, which for practical purposes is equivalent to datetime; we can use the two interchangeably.

In [14]:
dates[0]
Out[14]:
Timestamp('2016-10-10 00:00:00')

The pandas [date_range](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) function allows us to generate a range of dates according to a specified frequency.

In [15]:
dates=pd.date_range('1/Oct/2016', '5/Oct/2016')
print(dates)
DatetimeIndex(['2016-10-01', '2016-10-02', '2016-10-03', '2016-10-04',
               '2016-10-05'],
              dtype='datetime64[ns]', freq='D')

To set the frequency, we can set the date option. For example, for a business daily frequency:

In [16]:
dates=pd.date_range(start='1/Oct/2016', periods=5, freq='B')
print(dates)
DatetimeIndex(['2016-10-03', '2016-10-04', '2016-10-05', '2016-10-06',
               '2016-10-07'],
              dtype='datetime64[ns]', freq='B')

Monthly frequency:

In [17]:
dates=pd.date_range(start='1/Oct/2016', periods=5, freq='M')
print(dates)
DatetimeIndex(['2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31',
               '2017-02-28'],
              dtype='datetime64[ns]', freq='M')

The frequency option also accepts multiples:

In [18]:
dates=pd.date_range(start='1/Oct/2016', periods=5, freq='3M')
print(dates)
DatetimeIndex(['2016-10-31', '2017-01-31', '2017-04-30', '2017-07-31',
               '2017-10-31'],
              dtype='datetime64[ns]', freq='3M')

Refer to the pandas documentation for a list of available frequencies.

Periods

A related object is the Period, which represents a timespan like months and quarters, rather a point in time as in TimeStamp. For example, the variable below represents the period between 1/10/2016 and 31/10/2016.

In [19]:
month=pd.Period('Oct-2016', freq='M')
month
Out[19]:
Period('2016-10', 'M')

As with datetime and TimeStamp objects, we can perform operations with a Period object.

In [20]:
month+1 # from above
Out[20]:
Period('2016-11', 'M')

The counterpart of the DatetimeIndex is the PeriodIndex.

In [21]:
values=['2016Q1', '2016Q2', '2016Q3', '2016Q4']
index=pd.PeriodIndex(values, freq='Q')
index
Out[21]:
PeriodIndex(['2016Q1', '2016Q2', '2016Q3', '2016Q4'], dtype='int64', freq='Q-DEC')

To generate a period range:

In [22]:
pd.period_range('2016Q1','2016Q4', freq='Q')
Out[22]:
PeriodIndex(['2016Q1', '2016Q2', '2016Q3', '2016Q4'], dtype='int64', freq='Q-DEC')

Finally, we can convert timestamps to periods as follows.

In [23]:
dates=pd.date_range(start='1/Oct/2016', periods=5, freq='M')
print(dates)
dates.to_period()
DatetimeIndex(['2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31',
               '2017-02-28'],
              dtype='datetime64[ns]', freq='M')
Out[23]:
PeriodIndex(['2016-10', '2016-11', '2016-12', '2017-01', '2017-02'], dtype='int64', freq='M')

Reading time stamped data

Let us now work with data. For simplicity, our data will have only one column apart from the date. The same principles applies for working with data frames instead of a single series.

The nswretail.csv file contains monthly retail turnover figures for the state of NSW. I downloaded the data from the Australian Bureau of Statistics website. The ABS explanatory notes define retail turnover as:

Retail sales; wholesale sales; takings from repairs, meals and hiring of goods (except for rent, leasing and hiring of land and buildings); commissions from agency activity (e.g. commissions received from collecting dry cleaning, selling lottery tickets, etc.); and the goods and services tax.

To read the data, we follow the usual procedure. If you open the data file in a text editor, you will see that it has two columns: Month and Turnover. In a time series context, we want to make the date the index of the DataFrame by specifying Month as the index via the index_col option. We set the parse_dates options as True so that pandas can automatically recognise the dates column and convert them to TimeStamp objects.

In [24]:
ts=pd.read_csv('nswretail.csv', index_col='Month', parse_dates=True, dayfirst=True)
ts.tail() # tail gives the last 5 observations in the data
Out[24]:
Turnover
Month
2016-03-01 7804.1
2016-04-01 7659.0
2016-05-01 7749.9
2016-06-01 7708.4
2016-07-01 7742.6

We can see that pandas converted (say) "Jul-2016" in the text file to a full date, which by default is the first day of the month. Since we know these figures refer to the whole month, we want to convert the indexes from timestamps to periods:

In [25]:
ts.index=ts.index.to_period()
ts.tail()
Out[25]:
Turnover
Month
2016-03 7804.1
2016-04 7659.0
2016-05 7749.9
2016-06 7708.4
2016-07 7742.6

Subsetting a time series

Selecting part of the time series works in an intuitive way.

In [26]:
ts['Feb-2016':'Jul-2016']
Out[26]:
Turnover
Month
2016-02 7259.4
2016-03 7804.1
2016-04 7659.0
2016-05 7749.9
2016-06 7708.4
2016-07 7742.6
In [27]:
ts['2016Q2']
Out[27]:
Turnover
Month
2016-04 7659.0
2016-05 7749.9
2016-06 7708.4
In [28]:
ts['Feb2016':]
Out[28]:
Turnover
Month
2016-02 7259.4
2016-03 7804.1
2016-04 7659.0
2016-05 7749.9
2016-06 7708.4
2016-07 7742.6

Time series plot

Once we load the time series, the beginning of our analysis will always be to visualise the data. The simplest way to plot a time series is as follows.

In [29]:
ts['Turnover'].plot(color='red')
plt.xlabel('Year')
plt.ylabel('Turnover')
plt.title('Retail Turnover for NSW (2005-2016)')
plt.show()