Python for Data Analysis

Module 1 solutions (NumPy and statistics)


Exercise 1

In [1]:
import numpy as np
def pin_generator():
    pin=str(np.random.randint(0,9999))
    return '0'*(4-len(pin))+pin
In [2]:
y=pin_generator()
print(y)
2851

Exercise 2

This solution is more complete than required by the exercise as it accepts the password size as an argument.

In [3]:
def password_generator(size=6):
    letters='abcdefghijklmnopqrstuvwxyz'
    numbers='1234567890'
    special='!@#$%^&*'
    combined=2*letters+letters.upper()+numbers+special #the 2x makes lower case letters more frequent
    index=np.random.randint(0,len(combined)-1,size)
    return  ''.join(np.array(list(combined))[index])

# list(combined): creates a list where each element is a character from combined
# np.array(): converts the list into an arrray that can be efficiently indexed
# [index]: selects the indexes on the array corresponding to the randomly generated numbers
# ''.join(): combines the elements of what is inside the parenthesis to form a string, where each element is separated by ''  
In [4]:
print(password_generator(8))
azrVopPB

Exercise 3

In [5]:
from scipy import stats

def confidence_interval(x, alpha=0.05):
    xbar=np.mean(x)
    se=np.std(x, ddof=1)/np.sqrt(len(x))
    crit=stats.norm.ppf(1-alpha/2)
    return xbar-crit*se, xbar+crit*se 
In [6]:
x=np.random.normal(5,1,100)
print(confidence_interval(x))
(4.8205651000489569, 5.2256726368749709)

Exercise 5

Here, I will use the critical value from the Student's t distribution so that we get an exact confidence interval for the simulated data.

In [7]:
alpha, mu, sigma, N, S = 0.05, 0, 1, 100, 100000 
X=np.random.normal(mu,sigma,(N,S))

crit=stats.t.ppf(1-alpha/2, N-1)
xbar=np.mean(X, axis=0)
se=np.std(X, axis=0, ddof=1)/np.sqrt(N)

in_CI=np.sum((mu>(xbar-crit*se))&(mu<(xbar+crit*se)))
print(in_CI/S)
0.94936

Let's plot the simulated sampling distribution of the sample mean. Since the degrees of freedom is high, the distribution is very nearly Gaussian.

In [8]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
In [9]:
sns.distplot(xbar)
sns.plt.show()

Exercise 5

In this case the confidence interval is slightly too conservative.

In [10]:
mu=0
alpha, N, S = 0.05, 100, 100000
X=np.random.standard_t(3,(N,S))

crit=stats.t.ppf(1-alpha/2, N-1)
xbar=np.mean(X, axis=0)
se=np.std(X, axis=0, ddof=1)/np.sqrt(N)

in_CI=np.sum((mu>(xbar-crit*se))&(mu<(xbar+crit*se)))
print(in_CI/S)
0.95304

Even though the sample size is N=100, the simulated sampling distribution is fat-tailed and therefore far from Gaussian:

In [11]:
sns.distplot(xbar)
sns.plt.show()

Exercise 6

In [12]:
def ztest(x, mu0):
    xbar=np.mean(x)
    se=np.std(x, ddof=1)/np.sqrt(len(x))
    z=np.abs((xbar-mu0)/se)
    return 2*(1-stats.norm.cdf(z))
In [13]:
x=np.random.normal(0,1,100)
print(ztest(x,0))

x=np.random.normal(0.5,1,100)
print(ztest(x,0))
0.959853200012
5.93414079995e-05

Exercise 7

In [14]:
alpha, mu, sigma, N, S = 0.01, 0, 1, 10, 100000 
X=np.random.normal(mu,sigma,(N,S))

xbar=np.mean(X, axis=0)
se=np.std(X, axis=0, ddof=1)/np.sqrt(N)
t=np.abs((xbar-mu)/se)
pvals=2*(1-stats.t.cdf(t, N-1))

print(np.sum(pvals<alpha)/S)
0.0101
In [15]:
alpha, mu, N, S = 0.01, 0, 10, 100000 
X=np.random.standard_t(3,(N,S))

xbar=np.mean(X, axis=0)
se=np.std(X, axis=0, ddof=1)/np.sqrt(N)
z=np.abs((xbar-mu)/se)
pvals=2*(1-stats.norm.cdf(z))

print(np.sum(pvals<alpha)/S)
0.02202

In this case using an inappropriate test makes the actual sigfinicance level to be more tha twice as high as the nominal significance level.

Exercise 8

In [16]:
def par_bootstrap_test(x, mu0, alpha=0.05, S=10000):
    N=len(x)
    df, _, sigma=stats.t.fit(x) # the second output is the mean estimate, which we ignore
    X=stats.t.rvs(df, mu0, sigma, size=(N,S)) # using a different function so that we can set the mean and variance
    means=np.mean(X,axis=0)
    xbar=np.mean(x, axis=0)
    return np.sum(np.abs(means-mu0)>np.abs(xbar-mu0))/S
In [22]:
x=np.random.standard_t(3,10)
print(par_bootstrap_test(x,0))

x=np.random.standard_t(3,10)
print(par_bootstrap_test(x,1))
0.4999
0.0003

Exercise 9

In [23]:
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
data = web.DataReader('SPY', 'yahoo', start, end)
returns=(np.log(data['Adj Close'])-np.log(data['Adj Close'].shift(1))).as_matrix()
returns=returns[1:] # removes the missing value generated by creating the returns
In [24]:
print(par_bootstrap_test(returns,0))
0.3496
In [25]:
print(ztest(returns,0))
0.281048957545