NASDAG.org

A data scientist blog, by Philippe Dagher

My First Week at Metis

During my first week as Data Scientist, I was able to use the MTA subway data, to help my client optimize the placement of food trucks in the city of New York.

The MTA subway data is available at http://web.mta.info/developers/turnstile.html . Data files are provided freely once per week by the city of New York. Each turnstile logs every 4 hours its counters for cumulatives entries and exists. These logs report other information such as the control area and the remote unit as well as the station for a given turnstile.

Further below, you will find the Ipython Notebook that illustrates the work done during my first data science challenge. You can also download here the html version.


Let’s first download a few mta turnstile data files from http://web.mta.info/developers/turnstile.html

1
!wget http://web.mta.info/developers/data/nyct/turnstile/turnstile_141108.txt
--2015-01-19 08:40:43--  http://web.mta.info/developers/data/nyct/turnstile/turnstile_141108.txt
Resolving web.mta.info... 107.14.41.112, 107.14.41.104
Connecting to web.mta.info|107.14.41.112|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: 'turnstile_141108.txt.1'

turnstile_141108.tx     [  <=>                 ]  24.33M  1.57MB/s   in 16s    

2015-01-19 08:41:00 (1.49 MB/s) - 'turnstile_141108.txt.1' saved [25513691]
1
!wget http://web.mta.info/developers/data/nyct/turnstile/turnstile_141220.txt
--2015-01-19 08:41:00--  http://web.mta.info/developers/data/nyct/turnstile/turnstile_141220.txt
Resolving web.mta.info... 107.14.41.104, 107.14.41.112
Connecting to web.mta.info|107.14.41.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: 'turnstile_141220.txt.1'

turnstile_141220.tx     [   <=>                ]  24.04M  1.53MB/s   in 16s    

2015-01-19 08:41:16 (1.47 MB/s) - 'turnstile_141220.txt.1' saved [25208068]

Then we open a file, use a csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION). These are the first four columns. The value for this key is a list of lists. Each list in the list is the rest of the columns in a row.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import csv
import sys
d = {}
with open('turnstile_141206.txt', 'r') as turnstile_file:
    reader = csv.reader(turnstile_file)
    reader.next()
    for num, row in enumerate(reader):
        try:
            if num == 5:
                break
            k = tuple(row[:4])
            v = row[4:10]
            v.append(row[10].replace(" ",""))
            if k not in d:
                d[k]=[]
            d[k].append(v)
        except TypeError:
            print >> sys.stderr, 'row %i failed' % num

print d
{('A002', 'R051', '02-00-00', 'LEXINGTON AVE'): [['NQR456', 'BMT', '11/29/2014', '03:00:00', 'REGULAR', '0004894443', '0001659781'], ['NQR456', 'BMT', '11/29/2014', '07:00:00', 'REGULAR', '0004894453', '0001659793'], ['NQR456', 'BMT', '11/29/2014', '11:00:00', 'REGULAR', '0004894507', '0001659885'], ['NQR456', 'BMT', '11/29/2014', '15:00:00', 'REGULAR', '0004894735', '0001659947'], ['NQR456', 'BMT', '11/29/2014', '19:00:00', 'REGULAR', '0004895083', '0001660008']]}

For each key (basically the control area, unit, device address and station of a specific turnstile), we will keep a list comprised of just the point in time and the cumulative count of entries.

This means keeping only the date, time, and entries fields in each list. Date and time are converted into datetime objects – That is a python class that represents a point in time.

1
2
from datetime import datetime, date, time
datetime.strptime('11/01/2014'+' '+'13:45:31', '%m/%d/%Y %H:%M:%S')
datetime.datetime(2014, 11, 1, 13, 45, 31)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def load_turnstile_file(ts_file, slot='all', entries=True):
    d = {}
    with open(ts_file, 'r') as turnstile_file:
        reader = csv.reader(turnstile_file)
        reader.next()
        for num, row in enumerate(reader):
            try:
                k = tuple(row[:4])
                if entries==True:
                    v = [datetime.strptime(row[6]+' '+row[7], '%m/%d/%Y %H:%M:%S'),int(row[9])]
                else:
                    v = [datetime.strptime(row[6]+' '+row[7], '%m/%d/%Y %H:%M:%S'),int(row[10].replace(" ",""))]
                if k not in d:
                    d[k]=[]
                if (slot == 'all') or (slot == 'morning' and v[0].hour < 12) or (slot == 'noon' and (v[0].hour > 10 and v[0].hour<16)) or (slot == 'evening' and v[0].hour > 14):
                    d[k].append(v)
            except (TypeError, IndexError):
                print >> sys.stderr, 'row %i failed' % num
    return d


d = load_turnstile_file('turnstile_141220.txt')

These counts are cumulative every n hours. We want total daily entries. So let’s make it that we again have the same keys, but now with a single value for a single day, which is the total number of passengers that entered through this turnstile on this day.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#solution A
#1    { (turnstyle):[ [datetime1, counter1], [datetime2, counter2], [datetime3, counter3], ... ] }                d
#2                  [ [day1, counter1], [day1, counter2], [day2, counter3], ... ]                                 .date()
#3                  { day1:[counter1, counter2, ...], day2:[counter3, ...], ... }                                 day_to_cumulcounts
#4                  [ [day1, maxcountersday1 - mincountersday1], [day2, maxcountersday2 - mincountersday2], ...]  daily_count_dict

#solution B
#1    { (turnstyle):[ [datetime1, counter1], [datetime2, counter2], [datetime3, counter3], ... ] }
#2                  [ [day1, counter1, 0, 9999999999], [day1, counter2, 0, 9999999999], [day2, counter3, 0, 9999999999], ... ]
#3                  [ [day1, counter1, mincountersday1, maxcountersday1], [day1, counter2, mincountersday1, maxcountersday1], [day2, counter3, mincountersday2, maxcountersday2], ... ]
#4                  [ [day1, maxcountersday1 - mincountersday1], [day2, maxcountersday2 - mincountersday2], ...]

def solutionA(d):
    daily_count_dict = {}
    for i,j in d.items():
        day_to_cumulcounts = {}
        for v in j:
            day = v[0].date()
            if day not in day_to_cumulcounts:
                day_to_cumulcounts[day] = []
            day_to_cumulcounts[day].append(v[1])
        for day, counts in day_to_cumulcounts.items():
            daily_ridership = max(counts) - min(counts)
            if i not in daily_count_dict:
                daily_count_dict[i]=[]
            daily_count_dict[i].append( (day, daily_ridership))
    return daily_count_dict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# solutionB

def extend_d(d):
    dd = {}
    for i,j in d.items():
        for v in j:
            day = v[0].date()
            c = v[1]
            k = i
            v = [day, c, 99999999999, 0]
            if k not in dd:
                dd[k]=[]
            dd[k].append(v)
    return dd

def fill_min_max(dd):
    ddd = {}
    for i,j in dd.items():
        for v in j:
            d,c,mn,mx = v
            for vv in j:
                if vv[0]==d:
                    if mn>vv[1]:
                        mn=vv[1]
                    if mx<vv[1]:
                        mx=vv[1]
            k = i
            v = [d, c, mn, mx]
            if k not in ddd:
                ddd[k]=[]
            ddd[k].append(v)
    return ddd

def keep_diff(ddd):
    dddd = {}
    for i,j in ddd.items():
        for v in j:
            d = v[0]
            c = v[3]-v[2]
            if c > 100000: #if the counter is reset then remove inappropriate result
                c = 0
            k = i
            v = [d, c]
            if k not in dddd:
                dddd[k]=[]
            if v not in dddd[k]:
                dddd[k].append(v)
    return dddd
1
2
3
4
5
6
7
8
9
dddd = keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt'))))

ddddm = keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt', 'morning'))))

ddddn = keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt', 'noon'))))

dddde = keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt', 'evening'))))

dddd2 = keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141018.txt'))))
1
2
3
4
ddddA = solutionA(load_turnstile_file('turnstile_141108.txt'))

print dddd.items()[0]
print ddddA.items()[0]
(('A030', 'R083', '01-06-00', '23 ST-5 AVE'), [[datetime.date(2014, 11, 1), 1003], [datetime.date(2014, 11, 2), 979], [datetime.date(2014, 11, 3), 2387], [datetime.date(2014, 11, 4), 2472], [datetime.date(2014, 11, 5), 2585], [datetime.date(2014, 11, 6), 2641], [datetime.date(2014, 11, 7), 2538]])
(('A030', 'R083', '01-06-00', '23 ST-5 AVE'), [(datetime.date(2014, 11, 1), 1003), (datetime.date(2014, 11, 6), 2641), (datetime.date(2014, 11, 7), 2538), (datetime.date(2014, 11, 4), 2472), (datetime.date(2014, 11, 5), 2585), (datetime.date(2014, 11, 2), 979), (datetime.date(2014, 11, 3), 2387)])

We then plot the daily time series for a turnstile. We take the list of [(date1, count1), (date2, count2), …], for the turnstile and turn it into two lists: dates and counts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
%matplotlib inline
import matplotlib.pyplot as plt

a,b = dddd.items()[0]

am,bm = ddddm.items()[0]

an,bn = ddddn.items()[0]

ae,be = dddde.items()[0]

a2,b2 = dddd2.items()[0]

dates=[]
counts=[]
for x,y in b:
    dates.append(x)
    counts.append(y)

datesm=[]
countsm=[]
for x,y in bm:
    datesm.append(x)
    countsm.append(y)

datesn=[]
countsn=[]
for x,y in bn:
    datesn.append(x)
    countsn.append(y)

datese=[]
countse=[]
for x,y in be:
    datese.append(x)
    countse.append(y)

dates2=[]
counts2=[]
for x,y in b2:
    dates2.append(x)
    counts2.append(y)

plt.figure(figsize=(10,3))
plt.plot(dates,counts,datesm,countsm,datesn,countsn,datese,countse)
[<matplotlib.lines.Line2D at 0x111e79d10>,
 <matplotlib.lines.Line2D at 0x1171b8350>,
 <matplotlib.lines.Line2D at 0x1171b8a10>,
 <matplotlib.lines.Line2D at 0x1189d4090>]

png

So far we’ve been operating on a single turnstile level, let’s combine turnstiles in the same ControlArea/Unit/Station combo. There are some ControlArea/Unit/Station groups that have a single turnstile, but most have multiple turnstilea– same value for the C/A, UNIT and STATION columns, different values for the SCP column. We want to combine the numbers together – for each ControlArea/UNIT/STATION combo, for each day, let’s add the counts from each turnstile belonging to that combo.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def remove_scp(dddd):
    n = {}
    for a,b in dddd.items():
        k = (a[0],a[1],a[3])
        if k not in n:
            n[k]=[]
        li = n[k]
        li.extend(b)
        n[k]=li
    return n

n = remove_scp(dddd)

def aggregate_count(n):
    nnn = {}
    for i,j in n.items():
        l = {}
        for v in j:
            d = v[0]
            c = v[1]
            if d not in l:
                l[d] = 0
            l[d] = l[d] + c
        r = []
        for k, v in l.items():
            r.append([k,v])
        nnn[i]=sorted(r)
    return nnn

nnn = aggregate_count(remove_scp(dddd))

Similarly, lets’ combine everything in each station, and come up with a time series for each STATION, by adding up all the turnstiles in a station.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def keep_station(dddd):
    n = {}
    for a,b in dddd.items():
        k = (a[3])
        if k not in n:
            n[k]=[]
        li = n[k]
        li.extend(b)
        n[k]=li
    return n

n = remove_scp(dddd)
n2 = remove_scp(dddd2)

def aggregate_count(n):
    nnn = {}
    for i,j in n.items():
        l = {}
        for v in j:
            d = v[0]
            c = v[1]
            if d not in l:
                l[d] = 0
            l[d] = l[d] + c
        r = []
        for k, v in l.items():
            r.append([k,v])
        nnn[i]=sorted(r)
    return nnn

nnn = aggregate_count(keep_station(dddd))
nnnm = aggregate_count(keep_station(ddddm))
nnnn = aggregate_count(keep_station(ddddn))
nnne = aggregate_count(keep_station(dddde))
nnn2 = aggregate_count(keep_station(dddd2))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
print nnn.items()[0]
print nnn.items()[1]
print nnn2.items()[0]
print nnn2.items()[1]

def station_count(nnn, station_name='CROWN HTS-UTICA'):
    for x, y in nnn.items():
        if x == station_name:
            return y

import operator
def station_weekcount(nnn):
    d = {}
    for x, y in nnn.items():
        c = 0
        for t in y:
            c = c + t[1]
        d[x]=c
    return d

sc = station_count(nnn)
scm = station_count(nnnm)
scn = station_count(nnnn)
sce = station_count(nnne)
print sce
('BOYD-88 ST', [[datetime.date(2014, 11, 1), 291], [datetime.date(2014, 11, 2), 228], [datetime.date(2014, 11, 3), 579], [datetime.date(2014, 11, 4), 533], [datetime.date(2014, 11, 5), 552], [datetime.date(2014, 11, 6), 594], [datetime.date(2014, 11, 7), 576]])
('NEWKIRK PLAZA', [[datetime.date(2014, 11, 1), 5219], [datetime.date(2014, 11, 2), 4585], [datetime.date(2014, 11, 3), 11009], [datetime.date(2014, 11, 4), 10076], [datetime.date(2014, 11, 5), 11365], [datetime.date(2014, 11, 6), 10825], [datetime.date(2014, 11, 7), 10956]])
('BOYD-88 ST', [[datetime.date(2014, 10, 11), 213], [datetime.date(2014, 10, 12), 132], [datetime.date(2014, 10, 13), 607], [datetime.date(2014, 10, 14), 925], [datetime.date(2014, 10, 15), 738], [datetime.date(2014, 10, 16), 702], [datetime.date(2014, 10, 17), 499]])
('NEWKIRK PLAZA', [[datetime.date(2014, 10, 11), 5589], [datetime.date(2014, 10, 12), 4875], [datetime.date(2014, 10, 13), 8101], [datetime.date(2014, 10, 14), 11181], [datetime.date(2014, 10, 15), 11394], [datetime.date(2014, 10, 16), 10809], [datetime.date(2014, 10, 17), 7269]])
[[datetime.date(2014, 11, 1), 3101], [datetime.date(2014, 11, 2), 4601], [datetime.date(2014, 11, 3), 7069], [datetime.date(2014, 11, 4), 7024], [datetime.date(2014, 11, 5), 7227], [datetime.date(2014, 11, 6), 6880], [datetime.date(2014, 11, 7), 7865]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
dates=[1,2,3,4,5,6,7]

def station_count2list(sc):
    counts=[]
    for x,y in sc:
        counts.append(y)
    return counts

counts=station_count2list(sc)

countsm=station_count2list(scm)

countsn=station_count2list(scn)

countse=station_count2list(sce)

Let’s plot the time series for a station

1
2
3
4
plt.figure(figsize=(10,3))
plt.plot(dates,countsm, color='blue')
plt.plot(dates,countsn, color='green')
plt.plot(dates,countse, color='red')
[<matplotlib.lines.Line2D at 0x10e590d90>]

png

Let’s make one list of counts for one week for one station. Monday’s count, Tuesday’s count, etc. so it’s a list of 7 counts. Let’s make the same list for another week, and another week, and another week and a rainbow plot of weekly commute numbers on top of each other.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
dates=[1,2,3,4,5,6,7]

counts1=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141018.txt'))))))))

counts2=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_150110.txt'))))))))

counts3=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt'))))))))

counts4=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141220.txt'))))))))

plt.figure(figsize=(10,3))
plt.plot(dates,counts1)
plt.plot(dates,counts2)
plt.plot(dates,counts3)
plt.plot(dates,counts4)
[<matplotlib.lines.Line2D at 0x1185f2fd0>]

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
dates=[1,2,3,4,5,6,7]

counts1=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141101.txt'))))))))

counts2=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt'))))))))

counts3=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141115.txt'))))))))

counts4=station_count2list(station_count(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141122.txt'))))))))

plt.figure(figsize=(10,3))
plt.plot(dates,counts1)
plt.plot(dates,counts2)
plt.plot(dates,counts3)
plt.plot(dates,counts4)
[<matplotlib.lines.Line2D at 0x115f301d0>]

png

1
2
3
4
d1= station_weekcount(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141101.txt')))))))
d2= station_weekcount(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141108.txt')))))))
d3= station_weekcount(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141115.txt')))))))
d4= station_weekcount(aggregate_count(keep_station(keep_diff(fill_min_max(extend_d(load_turnstile_file('turnstile_141122.txt')))))))

Over multiple weeks, let’s sum total ridership for each station and sort them, so we can find out the stations with the highest traffic during the time we are investigating

1
2
sorted_d1 = sorted(d1.items(), key=operator.itemgetter(1))
print sorted_d1[350:]
[('PATH WTC', 230104), ('W 4 ST-WASH SQ', 235950), ('145 ST', 236173), ('LEXINGTON-53 ST', 243107), ('GROVE STREET', 246475), ('77 ST', 250046), ('JAY ST-METROTEC', 253341), ('14 ST', 255697), ('JAMAICA CENTER', 256465), ('BARCLAYS CENTER', 271981), ('50 ST', 277749), ('WALL ST', 284185), ('ROOSEVELT AVE', 294613), ('72 ST', 304716), ('47-50 ST-ROCK', 316219), ('MAIN ST', 359698), ('59 ST', 366449), ('23 ST', 402759), ('CANAL ST', 411394), ('CHAMBERS ST', 414954), ('FULTON ST', 424654), ('59 ST-COLUMBUS', 424784), ('96 ST', 464841), ('125 ST', 507787), ('42 ST-TIMES SQ', 534275), ('42 ST-PA BUS TE', 572339), ('14 ST-UNION SQ', 646295), ('86 ST', 655437), ('34 ST-HERALD SQ', 678610), ('42 ST-GRD CNTRL', 833260), ('34 ST-PENN STA', 1026642)]
1
2
3
4
5
6
7
d={}
for x,y in d1.items():
    d[x]=y+d2[x]+d3[x]+d4[x]

sorted_d = sorted(d.items(), key=operator.itemgetter(1))

print sorted_d[350:]
[('42 ST-BRYANT PK', 923064), ('CHURCH AVE', 926261), ('BOROUGH HALL/CT', 942163), ('145 ST', 961730), ('77 ST', 982905), ('JAMAICA CENTER', 1035962), ('LEXINGTON-53 ST', 1046754), ('14 ST', 1051173), ('JAY ST-METROTEC', 1063847), ('WALL ST', 1131397), ('BARCLAYS CENTER', 1132182), ('72 ST', 1196479), ('50 ST', 1199573), ('ROOSEVELT AVE', 1215277), ('47-50 ST-ROCK', 1380020), ('MAIN ST', 1474695), ('59 ST', 1572208), ('CANAL ST', 1588636), ('23 ST', 1616558), ('FULTON ST', 1670921), ('CHAMBERS ST', 1720315), ('59 ST-COLUMBUS', 1866056), ('96 ST', 1903673), ('125 ST', 2000267), ('42 ST-TIMES SQ', 2309378), ('42 ST-PA BUS TE', 2363820), ('14 ST-UNION SQ', 2403178), ('86 ST', 2742582), ('34 ST-HERALD SQ', 2962153), ('42 ST-GRD CNTRL', 3466958), ('34 ST-PENN STA', 4224665)]

Let’s make a single list of these total ridership values and plot it to get an idea about the distribution of total ridership among different stations. This will show that most stations have a small traffic, and the histogram bins for large traffic volumes have small bars.

To see which stations take the meat of the traffic, we will sort the total ridership counts and make a plt.bar graph. For this we will use two lists: the indices of each bar, and the values.

1
2
3
total_ridership_counts = [x for y,x in d.items()]
plt.figure(figsize=(10,3))
plt.hist(total_ridership_counts[:-300])
(array([ 26.,  19.,  12.,   9.,   5.,   2.,   3.,   1.,   2.,   2.]),
 array([   12987. ,   124906.5,   236826. ,   348745.5,   460665. ,
          572584.5,   684504. ,   796423.5,   908343. ,  1020262.5,
         1132182. ]),
 <a list of 10 Patch objects>)

png

1
2
3
total_ridership_values = sorted(total_ridership_counts, reverse=True)
indices = range(len(total_ridership_values))
plt.bar(indices, total_ridership_values)
<Container object of 381 artists>

png