Python for Business Analytics

Working with data in Python


The first step in any data project is to get the data ready for analysis. In a traditional business statistics curriculum, the dataset was usually taken for granted and ready for analysis from the start. However, this no longer reflects the reality in business and industry: the trend towards big data brought with it an increased demand for professionals with the computational skills to work possibly complex and unstructured data in sophisticated and productive way. It is commonly stated that 80% of a data scientist's job tends to be data preparation, exploratory data analysis (EDA), and visualisation.

In data science terminology the process of preparing data is called data wrangling or munging. Common tasks include importing data, merging datasets, identifying and handling errors (data cleaning), dealing with missing values, investigating outliers, transforming variable and creating new ones, etc.

In this lesson we explore the basics of how to use the pandas package to work efficiently with data in Python. As a complement to this guide, you can consult the 10 minutes to pandas tutorial in the official pandas documentation, which is a useful starting reference.

Content:

Dataset

In this lesson we will use the Credit dataset (please download from this link) taken from the Introduction to Statistical Learning texbook by James, Witten, Hastie and Tibshirani. The dataset records the average credit card balace at end of the month for customers of a financial services company, as well as other individual characteristics such age, education, gender, marital status, number of cards, and credit rating.

Importing and viewing data

We start by loading the data. First, we need to load the pandas package. Since the dataset is in the csv format, we use the read_csv function to import the file. The package will automatically read the column names and infer the variable types. We assign the data variable to a variable called data, which will store the dataset as an object called a DataFrame.

In [1]:
import pandas as pd
data=pd.read_csv('credit.csv')

To get help on any function or object, append a question mark to it and run the cell.

In [2]:
#pd.read_csv? # data?, data.head?, etc (the hash starts a comment, everything after it is ignored when running the cell)

A pandas DataFrame has some similarities to a spreadsheet. However, unlike a spreadsheet, you are not able to click through it or manually make modifications in the current setup. We do everything through coding. This may initially feel restrictive, but it is ultimately more efficient and scalable to work in this way. Some Python environments such as Spyder have GUIs (graphical user interfaces) for viewing data frames.

Two basic methods to have a first view of the data are the head and tail methods. The head method displays the first rows of the data (by default five), while the tail displays the last rows. You can specify the number of rows within the parentheses.

Running a cell with only the name of the DataFrame will provide a full view, but pandas limits number of rows that can be shown. See here if you want to change this setting.

Note that only the last line of a Jupyter cell will generate an output on the screen. We can use the print function from standard Python library to output multiple variables (see the next section for some examples). However, print does not include the DataFrame table formatting that you can see below.

In [3]:
data.head()
Out[3]:
Obs Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
0 1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
1 2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
2 3 104.593 7075 514 4 71 11 Male No No Asian 580
3 4 148.924 9504 681 3 36 11 Female No No Asian 964
4 5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331
In [4]:
data.tail()
Out[4]:
Obs Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
395 396 12.096 4100 307 3 32 13 Male No Yes Caucasian 560
396 397 13.364 3838 296 5 65 17 Male No No African American 480
397 398 57.872 4171 321 5 67 12 Female No Yes Caucasian 138
398 399 37.728 2525 192 1 44 13 Male No Yes Caucasian 0
399 400 18.701 5524 415 5 64 7 Female No No Asian 966

The rows of our DataFrame have a numerical index (in bold above), which is the default behaviour. An important detail, if you are not used to Python or some other programming languages, is that the index starts from zero instead of one. Numerical indexes start from zero in Python. This does not need to be the case in the DataFrame, but pandas follows the Python convention by default.

Alternatively, we can specify the DataFrame index (that is, a label for each row), which does not need to be a number. For example, if you have time series data, it can be the date. See the practice section below for another example.

In our case we can see that the first column is an observation index, so that we could specify that this is the case when reading the data.

In [5]:
data=pd.read_csv('credit.csv', index_col='Obs')
data.head()
Out[5]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
Obs
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
3 104.593 7075 514 4 71 11 Male No No Asian 580
4 148.924 9504 681 3 36 11 Female No No Asian 964
5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331

Alternatively:

In [6]:
data=pd.read_csv('credit.csv')
data.set_index('Obs', inplace=True)

Pandas has specialised functions for reading other types of input, such as Excel files. You can see a list of available functions here. The pandas read table function reads data stored as general delimited text files (for example, where the columns are separated by space rather than commas). In practical business situations, you may often need to obtain data from a relational database rather thah having to load a flat file stored in your computer. You can read database queries and tables input into a DataFrame by using use the read_sql_table, read_sql_query, or read_sql functions.

Our dataset here is simple to work with, but others may require customising the function call. Refer to the documentation for finding the appropriate options for other data that you come across.

Data selection

There are two ways to select data in pandas: by providing the column and index labels or by providing a numerical index.

Selecting a column by label

The output will now look different because selecting only one column returns a Series (a specialised object for when there is only one column of data) rather than a DataFrame.

In [7]:
data['Income'].head(2)
Out[7]:
Obs
1     14.891
2    106.025
Name: Income, dtype: float64

Selecting multiple columns by label

In [8]:
data[['Income','Education']].head(2)
Out[8]:
Income Education
Obs
1 14.891 11
2 106.025 15

Here, the inner brackets is to indicate that we are passing a list of column names. The example will make this clear, and is a useful template for some of what we will do.

In [9]:
names=['Income','Education']
print(type(names))
print(names)
data[names].head(2)
<class 'list'>
['Income', 'Education']
Out[9]:
Income Education
Obs
1 14.891 11
2 106.025 15

Selecting a column by a numerical index

The iloc method allows us to select data by numerical indexes. We just have to be careful not be confused by zero indexing. If want the first column then, the index needs to be zero. The following is equivalent to what we did in above.

In [10]:
data.iloc[:,0].head(2)
Out[10]:
Obs
1     14.891
2    106.025
Name: Income, dtype: float64

The : notation indicates we want to include all rows in the selection.

Selecting multiple columns by numerical indexes

Here, we pass a list of column numbers for indexing.

In [11]:
data.iloc[:,[0,5]].head(2)
Out[11]:
Income Education
Obs
1 14.891 11
2 106.025 15

Another method is slicing. Suppose that we want to select the data from the 1st to the 6th column. When specifying a range of integer indexes, the last one does not count. This may be initially confusing, but is the standard Python syntax. What the cell below does is to request indexes 0, 1, 2, 3, 4, 5, which correspond to columns 1-6.

In [12]:
data.iloc[:,0:6].head(2)
Out[12]:
Income Limit Rating Cards Age Education
Obs
1 14.891 3606 283 2 34 11
2 106.025 6645 483 3 82 15

Selecting rows by labels

The loc method alows to select rows by the designated index labels (Obs).

In [13]:
data.loc[[1,2,5],:]
Out[13]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
Obs
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331
In [14]:
data.loc[1:2,:]
Out[14]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
Obs
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903

Selecting rows by numerical index

This is useful when the index variable is a string or date. Here we are back to zero indexing.

In [15]:
data.iloc[0:2,:]
Out[15]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
Obs
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903

Jointly selecting rows and columns

We can combine the previous examples to simultaneously select specific rows and columns.

In [16]:
data.loc[1:2,['Income', 'Education']]
Out[16]:
Income Education
Obs
1 14.891 11
2 106.025 15
In [17]:
data.iloc[0:2,[0,5]]
Out[17]:
Income Education
Obs
1 14.891 11
2 106.025 15

Mixing labels and numerical indexes

As a more advanced concept, a slice of a DataFrame is itself a pandas object (a DataFrame if the slice has multiple columns, or a Series if it has only one). That means that we can chain operations when writing.

In [18]:
data[['Income', 'Education']].iloc[0:2,:]
Out[18]:
Income Education
Obs
1 14.891 11
2 106.025 15

Conditional selection

Suppose that we want to know the sample average credit card balance only for males. Below, we select the balance column and the rows such that the value of the gender column is male.

In [19]:
data.loc[data['Gender']==' Male','Balance'].mean()
Out[19]:
509.8031088082902

This is called boolean indexing in Python, because it involves the creation of binary variables indicating whether the condition is true of false for each row. The next cell will help you to understand this.

In [20]:
print(data['Gender'].head(2))
print(data['Gender'].head(2)==' Male')
Obs
1      Male
2    Female
Name: Gender, dtype: object
Obs
1     True
2    False
Name: Gender, dtype: bool

You can also specify multiple conditions. The following selects males with age equal or lower than 30. You can look at this reference for a list of Python comparison operators.

In [21]:
data.loc[(data['Gender']==' Male') & (data['Age']<=30),'Balance'].mean()
Out[21]:
516.3

Assigning new values to a data selection

You may have noted that the first character in "Male" above is a space. This is an unintentional error in the data. The unique method in pandas allows us to view all the unique values in a column. In this case, it confirms that all entries are like this.

In [22]:
data['Gender'].unique()
Out[22]:
array([' Male', 'Female'], dtype=object)

We can use our data selection knowledge to fix this. Below, we joinly select the rows in which the gender is " Male" and the Gender column. We then replace the values in those locations with the correct label.

In [23]:
data.loc[data['Gender']==' Male','Gender']='Male'
data.head(2)
Out[23]:
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
Obs
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903

Exporting data

Once you have made the necessary modications to the dataset, you may want to save it to continue working on it later. More generally, you may wish to save the results of your analysis or export tables so that they insert them in a report or webpage (after formatting). Pandas has methods to export data as csv and Excel files, LaTex and HTML tables, among other options.

In the following example I export our DataFrame as an Excel file. You can run the cell and try to open the file in Excel to check that it worked.

In [24]:
data.to_excel('new_data.xlsx')

Descriptive Statistics

After loading and preparing the data, we can start exploring it with basic descriptive statitistics. The describe method provides a table with basic summary statitics for the data.

In [25]:
data.describe().round(1) # here I appended round to limit the number of decimal places in the display, try without it
Out[25]:
Income Limit Rating Cards Age Education Balance
count 400.0 400.0 400.0 400.0 400.0 400.0 400.0
mean 45.2 4735.6 354.9 3.0 55.7 13.4 520.0
std 35.2 2308.2 154.7 1.4 17.2 3.1 459.8
min 10.4 855.0 93.0 1.0 23.0 5.0 0.0
25% 21.0 3088.0 247.2 2.0 41.8 11.0 68.8
50% 33.1 4622.5 344.0 3.0 56.0 14.0 459.5
75% 57.5 5872.8 437.2 4.0 70.0 16.0 863.0
max 186.6 13913.0 982.0 9.0 98.0 20.0 1999.0

There are also individual functions for a range of summary statistics. Refer to the pandas documentation for a full list of available functions. As examples, we calculate the means of the dataset and the correlation between income and credit card limit.

In [26]:
data.mean().round(2)
Out[26]:
Income         45.22
Limit        4735.60
Rating        354.94
Cards           2.96
Age            55.67
Education      13.45
Balance       520.02
dtype: float64
In [27]:
data[['Income','Limit']].corr().round(2)
Out[27]:
Income Limit
Income 1.00 0.79
Limit 0.79 1.00

Data column types

When preparing complex datasets for analysis, it is often useful to work with different types of variables (say, numerical or categorical) separately. You can use the dataframe info method to view the list of columns and their variable types, which pandas calls dtypes. The object dtype is the most general type: a column with this dtype will typically contain text data.

In [28]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 1 to 400
Data columns (total 11 columns):
Income       400 non-null float64
Limit        400 non-null int64
Rating       400 non-null int64
Cards        400 non-null int64
Age          400 non-null int64
Education    400 non-null int64
Gender       400 non-null object
Student      400 non-null object
Married      400 non-null object
Ethnicity    400 non-null object
Balance      400 non-null int64
dtypes: float64(1), int64(6), object(4)
memory usage: 57.5+ KB

To select columns of a particular type, you can do as follows:

In [29]:
text_variables=data.select_dtypes(['object'])
text_variables.head()
Out[29]:
Gender Student Married Ethnicity
Obs
1 Male No Yes Caucasian
2 Female Yes Yes Asian
3 Male No No Asian
4 Female No No Asian
5 Male No Yes Caucasian

Other data skills

This lesson covered the basic steps to get started with data wrangling in Python. Here are some other useful skills that we address in our units.

  • Data transformation and creating new variables.
  • Handling missing data and errors.
  • Combining and merging datasets.
  • Data aggregation and group operations.
  • Working with time stamped data.

If you aim to work as a data scientist, you should aim to build the following additional skills beyond this unit (among others).

  • Processing text data.
  • Reading and writing JSON files.
  • Extracting data from XML and HTML content.
  • Interacting with Web APIs.
  • Interacting with databases.

Practice

Download the companion notebook for exercises.