Benson // w1d2

Winter 2015
01/13/2015

Planned schedule and activities

9:00 am: Good morning!

9:15 am: Introduction to Project Benson

9:30 am: Python I/O (Input/Output)

10:00 am: Challenges

12:00 pm: Lunchy lunch

1:30 pm: Challenge problem continuation

3:30 pm: MTA data exploration as a class

1:30pm: Challenge problem continuation

5:00pm: Lillian and Megan talk about Careers

Foreshadowing for tomorrow:
* Groups of 3-4 people * For each group, think about what kind of client you're going to have. Draft your own client email (like the WomenTechWomenYes email)

Lecture Notes and Challenges

iPython Notebook on Python Input / Output (5.0 KB)

Benson Exploratory Data Analysis Notebook (11.7 KB)

The mta module used by the Benson notebook (9.4 KB)

Benson Challenges

Challenge 1

Challenge 2

For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the cumulative count of entries.

This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil module to convert it into a datetime object. For an example check this StackOverflow question.

Your new dict should look something like

{    ('A002','R051','02-00-00','LEXINGTON AVE'):    
         [
            [datetime.datetime(2013, 3, 2, 3, 0), 3788],
            [datetime.datetime(2013, 3, 2, 7, 0), 2585],
            [datetime.datetime(2013, 3, 2, 12, 0), 10653],
            [datetime.datetime(2013, 3, 2, 17, 0), 11016],
            [datetime.datetime(2013, 3, 2, 23, 0), 10666],
            [datetime.datetime(2013, 3, 3, 3, 0), 10814],
            [datetime.datetime(2013, 3, 3, 7, 0), 10229],
            ...
          ],
 ....
 }
Challenge 3

Now make it that we again have the same keys, but now we have a single value for a single day, which is not cumulative counts but the total number of passengers that entered through this turnstile on this day.

Challenge 4

We will plot the daily time series for a turnstile.

In ipython notebook, add this to the beginning of your next cell:

%matplotlib inline

This will make your matplotlib graphs integrate nicely with the notebook. To plot the time series, import matplotlib with

import matplotlib.pyplot as plt

Take the list of [(date1, count1), (date2, count2), ...], for the turnstile and turn it into two lists: dates and counts. This should plot it:

plt.figure(figsize=(10,3))
plt.plot(dates,counts)
Challenge 5

We want to combine the numbers together -- for each ControlArea/UNIT/STATION combo, for each day, add the counts from each turnstile belonging to that combo.

Challenge 6

Similarly, combine everything in each station, and come up with a time series of [(date1, count1),(date2,count2),...] type of time series for each STATION, by adding up all the turnstiles in a station.

Challenge 7

Plot the time series for a station

Challenge 8
Challenge 9
Challenge 10

plt.hist(total_ridership_counts)

to get an idea about the distribution of total ridership among different stations. This should show you that most stations have a small traffic, and the histogram bins for large traffic volumes have small bars.

Additional Hint:
If you want to see which stations take the meat of the traffic, you can sort the total ridership counts and make a plt.bar graph. For this, you want to have two lists: the indices of each bar, and the values. The indices can just be 0,1,2,3,..., so you can do

indices = range(len(total_ridership_values))
plt.bar(indices, total_ridership_values)