Introduction to the Research Environment

The research environment is powered by IPython notebooks, which allow one to perform a great deal of data analysis and statistical validation. We'll demonstrate a few simple techniques here.

Code Cells vs. Text Cells

As you can see, each cell can be either code or text. To select between them, choose from the 'Cell Type' dropdown menu on the top left.

Executing a Command

A code cell will be evaluated when you press play, or when you press the shortcut, shift-enter. Evaluating a cell evaluates each line of code in sequence, and prints the results of the last line below the cell.

In [1]:
2 + 2
Out[1]:
4

Sometimes there is no result to be printed, as is the case with assignment.

In [2]:
X = 2

Remember that only the result from the last line is printed.

In [3]:
2 + 2
3 + 3
Out[3]:
6

However, you can print whichever lines you want using the print statement.

In [4]:
print 2 + 2
3 + 3
4
Out[4]:
6

Knowing When a Cell is Running

While a cell is running, a [*] will display on the left. When a cell has yet to be executed, [ ] will display. When it has been run, a number will display indicating the order in which it was run during the execution of the notebook [5]. Try on this cell and note it happening.

In [5]:
#Take some time to run something
c = 0
for i in range(10000000):
    c = c + i
c
Out[5]:
49999995000000

Importing Libraries

The vast majority of the time, you'll want to use functions from pre-built libraries. You can't import every library on Quantopian due to security issues, but you can import most of the common scientific ones. Here I import numpy and pandas, the two most common and useful libraries in quant finance. I recommend copying this import statement to every new notebook.

Notice that you can rename libraries to whatever you want after importing. The as statement allows this. Here we use np and pd as aliases for numpy and pandas. This is a very common aliasing and will be found in most code snippets around the web. The point behind this is to allow you to type fewer characters when you are frequently accessing these libraries.

In [6]:
import numpy as np
import pandas as pd

# This is a plotting library for pretty pictures.
import matplotlib.pyplot as plt

Tab Autocomplete

Pressing tab will give you a list of IPython's best guesses for what you might want to type next. This is incredibly valuable and will save you a lot of time. If there is only one possible option for what you could type next, IPython will fill that in for you. Try pressing tab very frequently, it will seldom fill in anything you don't want, as if there is ambiguity a list will be shown. This is a great way to see what functions are available in a library.

Try placing your cursor after the . and pressing tab.

In [7]:
np.random.beta
Out[7]:
<function beta>

Getting Documentation Help

Placing a question mark after a function and executing that line of code will give you the documentation IPython has for that function. It's often best to do this in a new cell, as you avoid re-executing other code and running into bugs.

In [8]:
np.random.normal?

Sampling

We'll sample some random data using a function from numpy.

In [9]:
# Sample 100 points with a mean of 0 and an std of 1. This is a standard normal distribution.
X = np.random.normal(0, 1, 100)

Plotting

We can use the plotting library we imported as follows.

In [10]:
plt.plot(X)
Out[10]:
[<matplotlib.lines.Line2D at 0x7f442f459dd0>]

Squelching Line Output

You might have noticed the annoying line of the form [<matplotlib.lines.Line2D at 0x7f72fdbc1710>] before the plots. This is because the .plot function actually produces output. Sometimes we wish not to display output, we can accomplish this with the semi-colon as follows.

In [11]:
plt.plot(X);

Adding Axis Labels

No self-respecting quant leaves a graph without labeled axes. Here are some commands to help with that.

In [12]:
X = np.random.normal(0, 1, 100)
X2 = np.random.normal(0, 1, 100)

plt.plot(X);
plt.plot(X2);
plt.xlabel('Time') # The data we generated is unitless, but don't forget units in general.
plt.ylabel('Returns')
plt.legend(['X', 'X2']);

Generating Statistics

Let's use numpy to take some simple statistics.

In [13]:
np.mean(X)
Out[13]:
-0.026898970513545093
In [14]:
np.std(X)
Out[14]:
0.99233783955549493

Getting Real Pricing Data

Randomly sampled data can be great for testing ideas, but let's get some real data. We can use get_pricing to do that. You can use the ? syntax as discussed above to get more information on get_pricing's arguments.

In [15]:
data = get_pricing('MSFT', start_date='2012-1-1', end_date='2015-6-1')
In [16]:
data
Out[16]:
open_price high low close_price volume price
2012-01-03 00:00:00+00:00 24.065 24.436 23.920 24.319 60891291.0 24.319
2012-01-04 00:00:00+00:00 24.309 24.899 24.273 24.826 76534029.0 24.826
2012-01-05 00:00:00+00:00 24.817 25.133 24.736 25.080 53479335.0 25.080
2012-01-06 00:00:00+00:00 24.953 25.551 24.949 25.488 91671771.0 25.488
2012-01-09 00:00:00+00:00 25.424 25.470 25.125 25.143 56352965.0 25.143
2012-01-10 00:00:00+00:00 25.316 25.515 25.152 25.234 54223945.0 25.234
2012-01-11 00:00:00+00:00 24.862 25.361 24.808 25.125 62855941.0 25.125
2012-01-12 00:00:00+00:00 25.261 25.397 25.057 25.379 46186121.0 25.379
2012-01-13 00:00:00+00:00 25.316 25.606 25.189 25.606 55326851.0 25.606
2012-01-17 00:00:00+00:00 25.742 25.968 25.533 25.615 66537766.0 25.615
2012-01-18 00:00:00+00:00 25.660 25.742 25.352 25.588 60912302.0 25.588
2012-01-19 00:00:00+00:00 25.524 25.773 25.406 25.497 66169544.0 25.497
2012-01-20 00:00:00+00:00 26.122 26.956 26.059 26.929 157989713.0 26.929
2012-01-23 00:00:00+00:00 26.784 27.147 26.603 26.947 70185739.0 26.947
2012-01-24 00:00:00+00:00 26.711 26.802 26.449 26.594 48606276.0 26.594
2012-01-25 00:00:00+00:00 26.349 26.875 26.349 26.802 55871304.0 26.802
2012-01-26 00:00:00+00:00 26.838 26.920 26.648 26.748 46229333.0 26.748
2012-01-27 00:00:00+00:00 26.693 26.766 26.440 26.485 41452599.0 26.485
2012-01-30 00:00:00+00:00 26.258 26.847 26.131 26.838 46520829.0 26.838
2012-01-31 00:00:00+00:00 26.884 26.920 26.494 26.775 40812893.0 26.775
2012-02-01 00:00:00+00:00 27.002 27.237 26.974 27.092 64067228.0 27.092
2012-02-02 00:00:00+00:00 27.101 27.346 26.929 27.142 49959760.0 27.142
2012-02-03 00:00:00+00:00 27.319 27.554 27.273 27.391 38728694.0 27.391
2012-02-06 00:00:00+00:00 27.228 27.391 27.165 27.373 26067704.0 27.373
2012-02-07 00:00:00+00:00 27.328 27.631 27.237 27.518 36751791.0 27.518
2012-02-08 00:00:00+00:00 27.428 27.799 27.391 27.790 47382414.0 27.790
2012-02-09 00:00:00+00:00 27.808 27.917 27.627 27.890 45580004.0 27.890
2012-02-10 00:00:00+00:00 27.772 27.917 27.518 27.622 38526519.0 27.622
2012-02-13 00:00:00+00:00 27.763 27.890 27.582 27.699 31447419.0 27.699
2012-02-14 00:00:00+00:00 27.672 27.791 27.234 27.599 50770567.0 27.599
... ... ... ... ... ... ...
2015-04-20 00:00:00+00:00 41.461 42.891 41.411 42.623 38568282.0 42.623
2015-04-21 00:00:00+00:00 42.722 42.871 42.255 42.365 22952534.0 42.365
2015-04-22 00:00:00+00:00 42.394 42.852 42.275 42.702 21182014.0 42.702
2015-04-23 00:00:00+00:00 42.613 43.328 42.524 43.070 37178490.0 43.070
2015-04-24 00:00:00+00:00 45.365 47.829 45.355 47.561 114396800.0 47.561
2015-04-27 00:00:00+00:00 46.925 47.819 46.915 47.720 52844498.0 47.720
2015-04-28 00:00:00+00:00 47.471 48.892 47.392 48.843 53637731.0 48.843
2015-04-29 00:00:00+00:00 48.405 48.992 48.187 48.733 43326023.0 48.733
2015-04-30 00:00:00+00:00 48.386 49.220 48.286 48.346 56865088.0 48.346
2015-05-01 00:00:00+00:00 48.266 48.559 48.087 48.336 32389494.0 48.336
2015-05-04 00:00:00+00:00 48.058 48.554 47.869 47.929 30328340.0 47.929
2015-05-05 00:00:00+00:00 47.511 47.849 47.005 47.303 45944460.0 47.303
2015-05-06 00:00:00+00:00 47.263 47.462 45.723 45.991 47535797.0 45.991
2015-05-07 00:00:00+00:00 45.971 46.781 45.862 46.398 27047479.0 46.398
2015-05-08 00:00:00+00:00 47.243 47.670 47.213 47.412 27844465.0 47.412
2015-05-11 00:00:00+00:00 47.243 47.601 47.064 47.064 17346165.0 47.064
2015-05-12 00:00:00+00:00 46.547 47.372 46.120 47.054 24031639.0 47.054
2015-05-13 00:00:00+00:00 47.879 48.008 47.263 47.322 28548931.0 47.322
2015-05-14 00:00:00+00:00 47.720 48.505 47.720 48.405 26847365.0 48.405
2015-05-15 00:00:00+00:00 48.554 48.589 47.740 47.988 23411783.0 47.988
2015-05-18 00:00:00+00:00 47.670 47.909 47.303 47.700 20390841.0 47.700
2015-05-19 00:00:00+00:00 47.560 47.810 47.180 47.580 22462876.0 47.580
2015-05-20 00:00:00+00:00 47.390 47.930 47.270 47.580 18413349.0 47.580
2015-05-21 00:00:00+00:00 47.280 47.600 47.005 47.420 18679997.0 47.420
2015-05-22 00:00:00+00:00 47.300 47.350 46.820 46.900 21500514.0 46.900
2015-05-26 00:00:00+00:00 46.830 46.880 46.190 46.600 25219693.0 46.600
2015-05-27 00:00:00+00:00 46.820 47.770 46.620 47.620 22291914.0 47.620
2015-05-28 00:00:00+00:00 47.500 48.020 47.390 47.450 16527053.0 47.450
2015-05-29 00:00:00+00:00 47.430 47.570 46.590 46.860 25462536.0 46.860
2015-06-01 00:00:00+00:00 47.060 47.770 46.620 47.240 24322867.0 47.240

857 rows × 6 columns

Our data is now a dataframe. You can see the datetime index and the colums with different pricing data.

This is a pandas dataframe, so we can index in to just get price like this. For more info on pandas, please click here.

In [17]:
X = data['price']
In [18]:
X
Out[18]:
2012-01-03 00:00:00+00:00    24.319
2012-01-04 00:00:00+00:00    24.826
2012-01-05 00:00:00+00:00    25.080
2012-01-06 00:00:00+00:00    25.488
2012-01-09 00:00:00+00:00    25.143
2012-01-10 00:00:00+00:00    25.234
2012-01-11 00:00:00+00:00    25.125
2012-01-12 00:00:00+00:00    25.379
2012-01-13 00:00:00+00:00    25.606
2012-01-17 00:00:00+00:00    25.615
2012-01-18 00:00:00+00:00    25.588
2012-01-19 00:00:00+00:00    25.497
2012-01-20 00:00:00+00:00    26.929
2012-01-23 00:00:00+00:00    26.947
2012-01-24 00:00:00+00:00    26.594
2012-01-25 00:00:00+00:00    26.802
2012-01-26 00:00:00+00:00    26.748
2012-01-27 00:00:00+00:00    26.485
2012-01-30 00:00:00+00:00    26.838
2012-01-31 00:00:00+00:00    26.775
2012-02-01 00:00:00+00:00    27.092
2012-02-02 00:00:00+00:00    27.142
2012-02-03 00:00:00+00:00    27.391
2012-02-06 00:00:00+00:00    27.373
2012-02-07 00:00:00+00:00    27.518
2012-02-08 00:00:00+00:00    27.790
2012-02-09 00:00:00+00:00    27.890
2012-02-10 00:00:00+00:00    27.622
2012-02-13 00:00:00+00:00    27.699
2012-02-14 00:00:00+00:00    27.599
                              ...  
2015-04-20 00:00:00+00:00    42.623
2015-04-21 00:00:00+00:00    42.365
2015-04-22 00:00:00+00:00    42.702
2015-04-23 00:00:00+00:00    43.070
2015-04-24 00:00:00+00:00    47.561
2015-04-27 00:00:00+00:00    47.720
2015-04-28 00:00:00+00:00    48.843
2015-04-29 00:00:00+00:00    48.733
2015-04-30 00:00:00+00:00    48.346
2015-05-01 00:00:00+00:00    48.336
2015-05-04 00:00:00+00:00    47.929
2015-05-05 00:00:00+00:00    47.303
2015-05-06 00:00:00+00:00    45.991
2015-05-07 00:00:00+00:00    46.398
2015-05-08 00:00:00+00:00    47.412
2015-05-11 00:00:00+00:00    47.064
2015-05-12 00:00:00+00:00    47.054
2015-05-13 00:00:00+00:00    47.322
2015-05-14 00:00:00+00:00    48.405
2015-05-15 00:00:00+00:00    47.988
2015-05-18 00:00:00+00:00    47.700
2015-05-19 00:00:00+00:00    47.580
2015-05-20 00:00:00+00:00    47.580
2015-05-21 00:00:00+00:00    47.420
2015-05-22 00:00:00+00:00    46.900
2015-05-26 00:00:00+00:00    46.600
2015-05-27 00:00:00+00:00    47.620
2015-05-28 00:00:00+00:00    47.450
2015-05-29 00:00:00+00:00    46.860
2015-06-01 00:00:00+00:00    47.240
Freq: C, Name: price, dtype: float64

Because there is now also date information in our data, we provide two series to .plot. X.index gives us the datetime index, and X.values gives us the pricing values. These are used as the X and Y coordinates to make a graph.

In [19]:
plt.plot(X.index, X.values)
plt.ylabel('Price')
plt.legend(['MSFT']);

We can get statistics again on real data.

In [20]:
np.mean(X)
Out[20]:
34.49160093348889
In [21]:
np.std(X)
Out[21]:
7.309055602383863

Getting Returns from Prices

We can use the pct_change function to get returns. Notice how we drop the first element after doing this, as it will be NaN (nothing -> something results in a NaN percent change).

In [22]:
R = X.pct_change()[1:]
In [23]:
R
Out[23]:
2012-01-04 00:00:00+00:00    0.020848
2012-01-05 00:00:00+00:00    0.010231
2012-01-06 00:00:00+00:00    0.016268
2012-01-09 00:00:00+00:00   -0.013536
2012-01-10 00:00:00+00:00    0.003619
2012-01-11 00:00:00+00:00   -0.004320
2012-01-12 00:00:00+00:00    0.010109
2012-01-13 00:00:00+00:00    0.008944
2012-01-17 00:00:00+00:00    0.000351
2012-01-18 00:00:00+00:00   -0.001054
2012-01-19 00:00:00+00:00   -0.003556
2012-01-20 00:00:00+00:00    0.056163
2012-01-23 00:00:00+00:00    0.000668
2012-01-24 00:00:00+00:00   -0.013100
2012-01-25 00:00:00+00:00    0.007821
2012-01-26 00:00:00+00:00   -0.002015
2012-01-27 00:00:00+00:00   -0.009833
2012-01-30 00:00:00+00:00    0.013328
2012-01-31 00:00:00+00:00   -0.002347
2012-02-01 00:00:00+00:00    0.011839
2012-02-02 00:00:00+00:00    0.001846
2012-02-03 00:00:00+00:00    0.009174
2012-02-06 00:00:00+00:00   -0.000657
2012-02-07 00:00:00+00:00    0.005297
2012-02-08 00:00:00+00:00    0.009884
2012-02-09 00:00:00+00:00    0.003598
2012-02-10 00:00:00+00:00   -0.009609
2012-02-13 00:00:00+00:00    0.002788
2012-02-14 00:00:00+00:00   -0.003610
2012-02-15 00:00:00+00:00   -0.006594
                               ...   
2015-04-20 00:00:00+00:00    0.030761
2015-04-21 00:00:00+00:00   -0.006053
2015-04-22 00:00:00+00:00    0.007955
2015-04-23 00:00:00+00:00    0.008618
2015-04-24 00:00:00+00:00    0.104272
2015-04-27 00:00:00+00:00    0.003343
2015-04-28 00:00:00+00:00    0.023533
2015-04-29 00:00:00+00:00   -0.002252
2015-04-30 00:00:00+00:00   -0.007941
2015-05-01 00:00:00+00:00   -0.000207
2015-05-04 00:00:00+00:00   -0.008420
2015-05-05 00:00:00+00:00   -0.013061
2015-05-06 00:00:00+00:00   -0.027736
2015-05-07 00:00:00+00:00    0.008850
2015-05-08 00:00:00+00:00    0.021854
2015-05-11 00:00:00+00:00   -0.007340
2015-05-12 00:00:00+00:00   -0.000212
2015-05-13 00:00:00+00:00    0.005696
2015-05-14 00:00:00+00:00    0.022886
2015-05-15 00:00:00+00:00   -0.008615
2015-05-18 00:00:00+00:00   -0.006002
2015-05-19 00:00:00+00:00   -0.002516
2015-05-20 00:00:00+00:00    0.000000
2015-05-21 00:00:00+00:00   -0.003363
2015-05-22 00:00:00+00:00   -0.010966
2015-05-26 00:00:00+00:00   -0.006397
2015-05-27 00:00:00+00:00    0.021888
2015-05-28 00:00:00+00:00   -0.003570
2015-05-29 00:00:00+00:00   -0.012434
2015-06-01 00:00:00+00:00    0.008109
Freq: C, Name: price, dtype: float64

We can plot the returns distribution as a histogram.

In [24]:
plt.hist(R, bins=20)
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.legend(['MSFT Returns']);

Get statistics again.

In [25]:
np.mean(R)
Out[25]:
0.000879089143363588
In [26]:
np.std(R)
Out[26]:
0.014347860964324364

Now let's go backwards and generate data out of a normal distribution using the statistics we estimated from Microsoft's returns. We'll see that we have good reason to suspect Microsoft's returns may not be normal, as the resulting normal distribution looks far different.

In [27]:
plt.hist(np.random.normal(np.mean(R), np.std(R), 10000), bins=20)
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.legend(['Normally Distributed Returns']);

Generating a Moving Average

pandas has some nice tools to allow us to generate rolling statistics. Here's an example. Notice how there's no moving average for the first 60 days, as we don't have 60 days of data on which to generate the statistic.

In [28]:
# Take the average of the last 60 days at each timepoint.
MAVG = pd.rolling_mean(X, window=60)
plt.plot(X.index, X.values)
plt.plot(MAVG.index, MAVG.values)
plt.ylabel('Price')
plt.legend(['MSFT', '60-day MAVG']);
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:2: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=60,center=False).mean()
  

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.