Python for Business Analytics

NumPy and Statistics


Exercise 1

Write a function called PIN_generator that randomly generates a 4-digit numeric PIN. Find the appropriate function to help you with this task in the NumPy documentation.

Exercise 2

Write a function called password_generator that randomly generates a password with eight characters.

(a) Create a string containing all lowercase letters.

(b) Create a new string containg all uppercase letters by using the dir() command or the internet to find a method to convert a string to uppercase letters.

(c) Create a string containing numbers and specials characters.

(d) Merge the strings.

(e) Complete the exercise using the fact that we can do string indexing as below.

In [2]:
word='Sydney'
word[3]
Out[2]:
'n'

Exercise 3

Write a function that takes a NumPy array as an input and returns a $100\times(1-\alpha)\%$ confidence interval for the mean using a normal approximation to the sampling distribution. Recall that the formula for the approximate confidence interval is:


\begin{equation} \overline{X}\pm z_{\alpha/2}\frac{s_X}{\sqrt{N}} \end{equation}

You can test your program using simulated data generated as in the cell below. The following cell loads the package that you need for calculating the critical value. It is part of the SciPy library, which is part of NumPy. Try to find the appropriate function in SciPy documentation.

In [3]:
import numpy as np
mu, sigma=0, 5
y=np.random.normal(mu,sigma,100)
In [4]:
from scipy import stats

Exercise 4

In this exercise we will use NumPy to explore the concept of a confidence interval. Remember that a confidence interval is an interval estimator that has the statistical property that it contains the true parameter in $100\times(1-\alpha)\%$ of repeated samples.

(a) Specify the mean and variance of the population by assigning numerical values to the variables mu and sigma.

(b) Specify the sample size $N=100$. Create a variable $S=10000$.

(c) Create a 2-dimensional array containing $S$ independent samples of size $N$ from a normal population with mean $\mu$ and standard deviation $\sigma$. Your array should have $N$ rows and $S$ columns. The size option has format (rows,columns). Each column therefore represents a random sample of size $N$ from the population.

(d) Calculate a 95% confidence interval for each of your samples by using array computations. Compared to the previous exercise, you will have to specify the axis option in the mean and standard deviation functions.

(e) Calculate the proportion of $S$ confidence intervals that contain the true parameter $\mu$. If your program is correct, it will be approximately 0.95. The larger the $S$ the closer from $0.95$ it will be.

In practical settings, you can think of your observed data as one of these columns, with the number of columns going to infinity.

Exercise 5

Change the previous program to consider a Student's t population with three degrees of freedom). Alternatively, you may consider other distributions that you may be familiar with such as exponential or uniform. You should check that when the number of observations $N$ is low the confidence interval based on the normal approximation does not have the correct coverage. However, when $N$ gets larger the approximation becomes more accurate due to the Central Limit Theorem. What seems to be a large enough $N$ in this case?

The cell below allows you to visualise data generated from this model (don't worry about the details, we will cover visualisation in the next module).

In [5]:
y=np.random.standard_t(5,1000)

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(y)
sns.plt.show()

Exercise 6

Write a function ztest that calculates the p-value for a two-sided hypothesis test for the mean, where the null hypothesis is that the population mean is equal to an arbitrary value $\mu_0$. The test is based on a normal approximation to the sampling distribution of the sample mean. The function takes a sample (a NumPy array) and $\mu_0$ as inputs.

Exercise 7

As in Exercise 4, write a program that generates repeated samples of size $N$ from a normal population. Based on the samples and a signifance level of $\alpha$, calculate the proportion of samples for which the test of Exercise 6 rejects the null hypothesis that the mean is equal to the true mean in the data generating proccess. Then, generate samples from a Student's t distribution with 3 degrees of freedom and small sample size and verify that the nominal significance level is no longer accurate.

Exercise 8 (challenging)

The cell below shows you how to fit a Student's t distribution to data (also allowing for the estimation of the mean and standard deviation).

Write a function that implements a simulation-based hypothesis test using the logic of Exercise 7, but using the parameters estimated from the data to generate the artificial samples.

In [6]:
df, mu, sigma=stats.t.fit(y)
print(df)
print(mu)
print(sigma)
5.46639437802
-0.0170504748445
0.996047668422

Exercise 9 (challenging)

The next cells download data for the S&P500 index from Yahoo Finance, constructs the corresponding return series, and fits a Student's distribution to it. Test the null hypothesis that the returns are zero by using the normal approximation method from BUSS1020/QBUS5001/QBUS5002 and the method of Exercise 8. Are the p-values similar?

In [7]:
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
data = web.DataReader('SPY', 'yahoo', start, end)
In [8]:
data.head()
Out[8]:
Open High Low Close Volume Adj Close
Date
2010-01-04 112.370003 113.389999 111.510002 113.330002 118944600 99.292299
2010-01-05 113.260002 113.680000 112.849998 113.629997 111579900 99.555135
2010-01-06 113.519997 113.989998 113.430000 113.709999 116074400 99.625228
2010-01-07 113.500000 114.330002 113.180000 114.190002 131091100 100.045775
2010-01-08 113.889999 114.620003 113.660004 114.570000 126402800 100.378704
In [9]:
returns=(np.log(data['Adj Close'])-np.log(data['Adj Close'].shift(1))).as_matrix()
returns=returns[1:] # removes the missing value generated by creating the returns
In [10]:
df, mu, sigma=stats.t.fit(returns)
print(df)
print(mu)
print(sigma)
2.72821796669
0.000878429335466
0.0071277385863
In [11]:
sns.distplot(returns)
sns.plt.show()
In [ ]: