The first step in any data project is to get the data ready for analysis. In a traditional business statistics curriculum, the dataset was usually taken for granted and ready for analysis from the start. However, this no longer reflects the reality in business and industry: the trend towards big data brought with it an increased demand for professionals with the computational skills to work possibly complex and unstructured data in sophisticated and productive way. It is commonly stated that 80% of a data scientist's job tends to be data preparation, exploratory data analysis (EDA), and visualisation.
In data science terminology the process of preparing data is called data wrangling or munging. Common tasks include importing data, merging datasets, identifying and handling errors (data cleaning), dealing with missing values, investigating outliers, transforming variable and creating new ones, etc.
In this lesson we explore the basics of how to use the pandas package to work efficiently with data in Python. As a complement to this guide, you can consult the 10 minutes to pandas tutorial in the official pandas documentation, which is a useful starting reference.
In this lesson we will use the Credit dataset (please download from this link) taken from the Introduction to Statistical Learning texbook by James, Witten, Hastie and Tibshirani. The dataset records the average credit card balace at end of the month for customers of a financial services company, as well as other individual characteristics such age, education, gender, marital status, number of cards, and credit rating.
We start by loading the data. First, we need to load the pandas package. Since the dataset is in the csv format, we use the read_csv function to import the file. The package will automatically read the column names and infer the variable types. We assign the data variable to a variable called data, which will store the dataset as an object called a DataFrame.
import pandas as pd data=pd.read_csv('credit.csv')
To get help on any function or object, append a question mark to it and run the cell.
#pd.read_csv? # data?, data.head?, etc (the hash starts a comment, everything after it is ignored when running the cell)
A pandas DataFrame has some similarities to a spreadsheet. However, unlike a spreadsheet, you are not able to click through it or manually make modifications in the current setup. We do everything through coding. This may initially feel restrictive, but it is ultimately more efficient and scalable to work in this way. Some Python environments such as Spyder have GUIs (graphical user interfaces) for viewing data frames.
Two basic methods to have a first view of the data are the head and tail methods. The head method displays the first rows of the data (by default five), while the tail displays the last rows. You can specify the number of rows within the parentheses.
Running a cell with only the name of the DataFrame will provide a full view, but pandas limits number of rows that can be shown. See here if you want to change this setting.
Note that only the last line of a Jupyter cell will generate an output on the screen. We can use the print function from standard Python library to output multiple variables (see the next section for some examples). However, print does not include the DataFrame table formatting that you can see below.
The rows of our DataFrame have a numerical index (in bold above), which is the default behaviour. An important detail, if you are not used to Python or some other programming languages, is that the index starts from zero instead of one. Numerical indexes start from zero in Python. This does not need to be the case in the DataFrame, but pandas follows the Python convention by default.
Alternatively, we can specify the DataFrame index (that is, a label for each row), which does not need to be a number. For example, if you have time series data, it can be the date. See the practice section below for another example.
In our case we can see that the first column is an observation index, so that we could specify that this is the case when reading the data.
data=pd.read_csv('credit.csv', index_col='Obs') data.head()
data=pd.read_csv('credit.csv') data.set_index('Obs', inplace=True)
Pandas has specialised functions for reading other types of input, such as Excel files. You can see a list of available functions here. The pandas read table function reads data stored as general delimited text files (for example, where the columns are separated by space rather than commas). In practical business situations, you may often need to obtain data from a relational database rather thah having to load a flat file stored in your computer. You can read database queries and tables input into a DataFrame by using use the read_sql_table, read_sql_query, or read_sql functions.
Our dataset here is simple to work with, but others may require customising the function call. Refer to the documentation for finding the appropriate options for other data that you come across.
There are two ways to select data in pandas: by providing the column and index labels or by providing a numerical index.
The output will now look different because selecting only one column returns a Series (a specialised object for when there is only one column of data) rather than a DataFrame.
Obs 1 14.891 2 106.025 Name: Income, dtype: float64
Here, the inner brackets is to indicate that we are passing a list of column names. The example will make this clear, and is a useful template for some of what we will do.
names=['Income','Education'] print(type(names)) print(names) data[names].head(2)
<class 'list'> ['Income', 'Education']
The iloc method allows us to select data by numerical indexes. We just have to be careful not be confused by zero indexing. If want the first column then, the index needs to be zero. The following is equivalent to what we did in above.
Obs 1 14.891 2 106.025 Name: Income, dtype: float64
The : notation indicates we want to include all rows in the selection.
Here, we pass a list of column numbers for indexing.
Another method is slicing. Suppose that we want to select the data from the 1st to the 6th column. When specifying a range of integer indexes, the last one does not count. This may be initially confusing, but is the standard Python syntax. What the cell below does is to request indexes 0, 1, 2, 3, 4, 5, which correspond to columns 1-6.
This is useful when the index variable is a string or date. Here we are back to zero indexing.
We can combine the previous examples to simultaneously select specific rows and columns.
As a more advanced concept, a slice of a DataFrame is itself a pandas object (a DataFrame if the slice has multiple columns, or a Series if it has only one). That means that we can chain operations when writing.
Suppose that we want to know the sample average credit card balance only for males. Below, we select the balance column and the rows such that the value of the gender column is male.
This is called boolean indexing in Python, because it involves the creation of binary variables indicating whether the condition is true of false for each row. The next cell will help you to understand this.
print(data['Gender'].head(2)) print(data['Gender'].head(2)==' Male')
Obs 1 Male 2 Female Name: Gender, dtype: object Obs 1 True 2 False Name: Gender, dtype: bool
You can also specify multiple conditions. The following selects males with age equal or lower than 30. You can look at this reference for a list of Python comparison operators.
data.loc[(data['Gender']==' Male') & (data['Age']<=30),'Balance'].mean()
array([' Male', 'Female'], dtype=object)
We can use our data selection knowledge to fix this. Below, we joinly select the rows in which the gender is " Male" and the Gender column. We then replace the values in those locations with the correct label.
data.loc[data['Gender']==' Male','Gender']='Male' data.head(2)
Once you have made the necessary modications to the dataset, you may want to save it to continue working on it later. More generally, you may wish to save the results of your analysis or export tables so that they insert them in a report or webpage (after formatting). Pandas has methods to export data as csv and Excel files, LaTex and HTML tables, among other options.
In the following example I export our DataFrame as an Excel file. You can run the cell and try to open the file in Excel to check that it worked.
After loading and preparing the data, we can start exploring it with basic descriptive statitistics. The describe method provides a table with basic summary statitics for the data.
data.describe().round(1) # here I appended round to limit the number of decimal places in the display, try without it
There are also individual functions for a range of summary statistics. Refer to the pandas documentation for a full list of available functions. As examples, we calculate the means of the dataset and the correlation between income and credit card limit.
Income 45.22 Limit 4735.60 Rating 354.94 Cards 2.96 Age 55.67 Education 13.45 Balance 520.02 dtype: float64
When preparing complex datasets for analysis, it is often useful to work with different types of variables (say, numerical or categorical) separately. You can use the dataframe info method to view the list of columns and their variable types, which pandas calls dtypes. The object dtype is the most general type: a column with this dtype will typically contain text data.
<class 'pandas.core.frame.DataFrame'> Int64Index: 400 entries, 1 to 400 Data columns (total 11 columns): Income 400 non-null float64 Limit 400 non-null int64 Rating 400 non-null int64 Cards 400 non-null int64 Age 400 non-null int64 Education 400 non-null int64 Gender 400 non-null object Student 400 non-null object Married 400 non-null object Ethnicity 400 non-null object Balance 400 non-null int64 dtypes: float64(1), int64(6), object(4) memory usage: 57.5+ KB
To select columns of a particular type, you can do as follows:
This lesson covered the basic steps to get started with data wrangling in Python. Here are some other useful skills that we address in our units.
If you aim to work as a data scientist, you should aim to build the following additional skills beyond this unit (among others).