NASDAG.org

A data scientist blog, by Philippe Dagher

Luther

We will use BeautifulSoup to scrap data from http://boxofficemojo.com/ and http://allocine.fr/

1
2
import urllib2
from bs4 import BeautifulSoup

Let’s define first some basic functions to convert scraped data into exploitable format

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import dateutil.parser
import urllib

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

def money_to_int(moneystring):
    try:
        moneystring = moneystring.replace('$','').replace(',','')
        return int(moneystring)
    except:
        return None

def clear_url(raw_url):
    tmp_url = raw_url.replace('(','').replace(')','').replace('-',' ').replace('.','')
    tmp_url = tmp_url.replace(u'é','e').replace(u'â','a').replace(',','e').encode('latin1','ignore')
    return tmp_url

Loading from the international section of Mojo the tables with top raking movies by revenues from France, for years 2010 to 2014. For each year scan up to for pages with 100 movies per page. Rows contain, movie rank for a specific year, movie name (without the mojo url), French distributor, French gross and French release date (to which we will append the year).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
data = []
url = "http://boxofficemojo.com/intl/france/yearly/?yr="

for scan in ['2014','2013','2012','2011','2010']:
    for scan_page in [str(x+1) for x in range(4)]:
        page = urllib2.urlopen(url+scan+'&pagenum='+scan_page)
        soup = BeautifulSoup(page)
        try:
            table = soup.find_all('table')
            data_table = table[4]
            rows = data_table.find_all('tr')
            for row in rows[1:]:
                cols = row.find_all('td')
                cols = [ele.text.strip() for ele in cols]
                cols[4]=cols[4]+'/'+scan
                data.append(cols)
        except IndexError:
            pass

To get more info from Mojo for each movie we will have to visit its specific Mojo page which we will find with the Mojo search engine. From the search result, we will pick the movie that has a US release date within 180 days from the French release date. The search engine gives information such as the Studio, US box office, US number of theaters as well as the revenue and number of theaters of the US opening week-end ; and once on the movie page, we can get information such as the Director of the movie.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
data_with_url_and_release_dates = [['FrenchRank',
                                    'MovieName',
                                    'MojoUrl',
                                    'FrenchDistributor',
                                    'FrenchGross',
                                    'FrenchReleaseDate',
                                    'Studio',
                                    'USGross',
                                    'USNumberOfTheaters',
                                    'USOpening',
                                    'USOpeningTheaters',
                                    'USReleaseDate',
                                    'Director']]
for row in data:
    url = "http://boxofficemojo.com/search/?q="
    query = urllib.quote(clear_url(row[1]))
    try:
        page = urllib2.urlopen(url+query)
        soup = BeautifulSoup(page)
        content = []
        release_date_given = to_date(row[4])

        try:
            table = soup.find_all('table')
            link_table = table[4]
            rows = link_table.find_all('tr')
            for x in rows[1:]:
                cols = x.find_all('td')
                col_with_link = cols[0]
                col_with_release_date = cols[6]
                try:
                    release_date_searched = to_date(col_with_release_date.text)
                except ValueError:
                    pass
                delta = release_date_searched - release_date_given
                if abs(delta.days) < 180:
                    matching_url = 'http://boxofficemojo.com'+col_with_link.find('a')['href']
                    director = url_to_director(matching_url)
                    row_data = [int(row[0]), # french rank
                                row[1], # name of the movie
                                matching_url, # url
                                row[2], # distributor
                                money_to_int(row[3]), # french gross
                                release_date_given, # french release date
                                cols[1].text, # studio
                                money_to_int(cols[2].text), # us gross
                                money_to_int(cols[3].text), # us number of theaters
                                money_to_int(cols[4].text), # us opening
                                money_to_int(cols[5].text), # number of us opening theaters
                                release_date_searched, # us release date
                                director] # us director
                    data_with_url_and_release_dates.append(row_data)
                    break
        except:
            pass
    except:
        pass

Save data to a cvs file

1
2
3
4
5
6
7
import csv

with open('nasdag_luther_data.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for row in data_with_url_and_release_dates:
        row = [c.encode('utf8') if isinstance(c, unicode) else c for c in row]
        writer.writerow(row)

Before scraping data from Allociné we need to define some functions.

The first one will use the search engine of Allociné to find the corresponding movie page of a movie name with a release date and a director.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def allocine(movie_name, release_date_given, director):
    url = "http://www.allocine.fr/recherche/?q="
    query = clear_url(movie_name.replace(' ','+'))
    page = urllib2.urlopen(url+query)
    soup = BeautifulSoup(page)

    try:
        table = soup.find_all('table')
        link_table = table[0]
        rows = link_table.find_all('tr')
        for x in rows:
            try:
                cols = x.find_all('td')
                col_with_link = cols[1]
                search_text = col_with_link.find(class_='fs11').text
                release_date_year = search_text.split()[0]
                if release_date_year == release_date_given and director in search_text:
                    fr_movie_url = 'http://www.allocine.fr'+col_with_link.find('a')['href']
                    return fr_movie_url
            except:
                pass
    except:
        pass
    return None

This function will locate the rating and the number of entries in a given Allociné movie page

1
2
3
4
5
6
7
8
9
10
def french_press_rating_with_entries(fr_movie_link):
    page = urllib2.urlopen(fr_movie_link)
    soup = BeautifulSoup(page)
    try:
        note = soup.find(class_='note').text.replace(',','.')
        entries = "".join(soup.find_all(class_="visible")[1].find_all('td')[2].text.split()[:-1])
        return float(note), int (entries)
    except:
        pass
    return None, None

This function will locate the Director name in a given Mojo movie page (used during data scraping from Mojo)

1
2
3
4
5
6
7
8
9
10
11
12
def url_to_director(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page)

    cells = soup.find_all('td')

    previous = cells[0]
    for cell in cells[1:]:
        if previous.find(text='Director:') or previous.find(text='Directors:') or previous.find(text='Producers:'):
            return cell.find_all('a')[0].text
        previous = cell
    return ''

Let’s now load the date of the previous data scraping phase

1
2
3
4
5
6
import csv

cr = csv.reader(open('nasdag_luther_data.csv','rb'))
data_with_url_and_release_dates = []
for row in cr:
    data_with_url_and_release_dates.append(row)

We can run the data scraping script on Allociné and append the French rating and number of entries to each movie

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from dateutil import parser

data_with_allocine = []

row0 = data_with_url_and_release_dates[0]
row0.append('FrRating')
row0.append('FrEntries')

data_with_allocine.append(row0)

for row in data_with_url_and_release_dates[1:]:
    try:
        french_url = allocine(row[1], str(parser.parse(row[5]).year), row[12])
        fr_rating, fr_entries = french_press_rating_with_entries(french_url)
        row_data = row
        row_data.append(fr_rating)
        row_data.append(fr_entries)
        data_with_allocine.append(row_data)
    except:
        pass

And finally save the complete data to a cvs file

1
2
3
4
5
with open('nasdag_luther_data_with_french_rating.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for row in data_with_url_and_release_dates:
        row = [c.encode('utf8') if isinstance(c, unicode) else c for c in row]
        writer.writerow(row)

We will now analyze the scraped data from Mojo and Allociné to check for a linear regression between French number of entries and other available information.

So let’s load the data from the cvs file

1
2
3
4
5
6
7
8
9
10
%matplotlib inline
import pandas as pd
import numpy as np
from datetime import datetime

movies = pd.read_csv('nasdag_luther_data_with_french_rating.csv')
movies['FrenchReleaseDate'] = pd.tseries.tools.to_datetime(movies['FrenchReleaseDate'])
movies['USReleaseDate'] = pd.tseries.tools.to_datetime(movies['USReleaseDate'])

print 'Number of loaded movies:', len(movies.index)
Number of loaded movies: 702
1
print movies.tail(2)
     FrenchRank               MovieName  \
700         335  The Thorn in the Heart   
701         338     Dinner for Schmucks   

                                               MojoUrl FrenchDistributor  \
700  http://boxofficemojo.com/movies/?id=thorninthe...               n/a   
701  http://boxofficemojo.com/movies/?id=dinnerfors...               PPI   

     FrenchGross FrenchReleaseDate Studio   USGross  USNumberOfTheaters  \
700         7172        2010-04-21  Osci.      7376                   1   
701         6132        2010-11-10   P/DW  73026337                3046   

     USOpening  USOpeningTheaters USReleaseDate       Director  FrRating  \
700       5173                  1    2010-04-02  Michel Gondry       NaN   
701   23527839               2911    2010-07-30      Jay Roach       NaN   

     FrEntries  
700        NaN  
701        NaN  
1
2
3
4
5
import matplotlib.pyplot as plt
(movies['FrEntries']/1e6).hist(bins=7)
plt.title('Histogram of movies by Entries in France')
plt.xlabel('Number of entries (in millions)')
plt.ylabel('Number of movies')
<matplotlib.text.Text at 0x11482ed10>

png

1
2
3
4
movies['FrRating'].hist(bins=10, range=(0.5,5.5))
plt.title('Histogram of movies by French Rating')
plt.xlabel('Rating')
plt.ylabel('Number of movies')
<matplotlib.text.Text at 0x114a9e910>

png

1
2
3
4
5
6
7
8
9
movies_with_rating = movies.dropna()
plt.scatter(movies_with_rating.FrRating , movies_with_rating.FrEntries/1e6)

plt.title('How does rating relate to Entries in France?')
plt.xlabel('Rating')
plt.ylabel('Entries in France (in millions)')

from matplotlib.ticker import MaxNLocator
plt.gca().yaxis.set_major_locator(MaxNLocator(prune='lower'))

png

Let’s now focus on the data of movies from a selected list of US studios with rating information

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
print 'Number of preselected movies (with data):', len(movies_with_rating.index)
movies_with_rating_from_studio = movies_with_rating[ movies_with_rating.Studio.isin(['Fox',
                                                                                     'FoxS',
                                                                                     'WB',
                                                                                     'WB (NL)',
                                                                                     'Disney',
                                                                                     'Warner Bros.',
                                                                                     'Metropolitan',
                                                                                     'UPI',
                                                                                     'CBS',
                                                                                     'Sony',
                                                                                     'TriS',
                                                                                     'RTWC',
                                                                                     'Wild Bunch'])]
length = len(movies_with_rating_from_studio.index)
print 'Number of movies from selected studios:', length

plt.scatter(movies_with_rating_from_studio.FrRating , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does rating relate to Entries in France?')
plt.xlabel('Rating')
plt.ylabel('Entries in France (in millions)')
plt.gca().yaxis.set_major_locator(MaxNLocator(prune='lower'))
Number of preselected movies (with data): 351
Number of movies from selected studios: 143

png

Is there any direct relation between rating and entries?

1
movies_with_rating_from_studio['Ones'] = 1.0
1
2
3
4
5
6
7
import statsmodels.formula.api as sm

X = movies_with_rating_from_studio[['FrRating', 'Ones']]
Y = movies_with_rating_from_studio['FrEntries']

linmodel = sm.OLS(Y, X).fit()
print linmodel.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              FrEntries   R-squared:                       0.171
Model:                            OLS   Adj. R-squared:                  0.165
Method:                 Least Squares   F-statistic:                     29.06
Date:                Sun, 01 Feb 2015   Prob (F-statistic):           2.88e-07
Time:                        21:20:45   Log-Likelihood:                -2201.5
No. Observations:                 143   AIC:                             4407.
Df Residuals:                     141   BIC:                             4413.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
FrRating    7.022e+05    1.3e+05      5.391      0.000      4.45e+05   9.6e+05
Ones       -8.044e+05   3.78e+05     -2.130      0.035     -1.55e+06 -5.79e+04
==============================================================================
Omnibus:                       58.202   Durbin-Watson:                   0.966
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              150.153
Skew:                           1.678   Prob(JB):                     2.48e-33
Kurtosis:                       6.734   Cond. No.                         12.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1
2
3
4
5
6
7
8
9
plt.scatter(movies_with_rating_from_studio.FrRating, movies_with_rating_from_studio.FrEntries/1e6)

plt.title('How does rating relate to Entries in France?')
plt.xlabel('Rating')
plt.ylabel('Entries in France (in millions)')

plt.plot(movies_with_rating_from_studio.FrRating, linmodel.predict(X)/1e6, 'r-')

plt.gca().yaxis.set_major_locator(MaxNLocator(prune='lower'))

png

Doublecheck with sklearn…

1
2
3
4
5
6
7
8
9
10
11
from sklearn.linear_model import LinearRegression

X = movies_with_rating_from_studio[['FrRating']]
Y = movies_with_rating_from_studio['FrEntries']

sk_linmodel = LinearRegression()
sk_linmodel.fit(X,Y)

print 'coefficient:\t %g' % sk_linmodel.coef_
print 'intercept:  \t %g' % sk_linmodel.intercept_
print 'R^2:        \t %g' % sk_linmodel.score(X,Y)
coefficient:     702188
intercept:       -804445
R^2:             0.170878
1
2
3
4
5
6
7
plt.plot(Y/1e6, sk_linmodel.predict(X)/1e6, 'go')
plt.xlabel('Actual French Entries (in millions)')
plt.ylabel('Predicted French Entries By Model')

plt.plot(Y/1e6, Y/1e6, 'm--', label='Ideal Prediction')
plt.gca().yaxis.set_major_locator(MaxNLocator(prune='lower'))
plt.legend(loc='upper left')
<matplotlib.legend.Legend at 0x1147a1510>

png

The basic model is not very satisfactory. We will then use other information to get a better result…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
movies_with_rating_from_studio['ReleaseDelta'] = \
            [(((d1-d2).total_seconds())/86400.0) for d1, d2 in \
             zip(movies_with_rating_from_studio.FrenchReleaseDate, movies_with_rating_from_studio.USReleaseDate)]

movies_with_rating_from_studio['FrRating2'] = [x*x for x in movies_with_rating_from_studio.FrRating]
movies_with_rating_from_studio['ReleaseDelta2'] = [x*x for x in movies_with_rating_from_studio.ReleaseDelta]
movies_with_rating_from_studio['USOpening2'] = [x*x for x in movies_with_rating_from_studio.USOpening]

from math import sqrt
movies_with_rating_from_studio['FrRatingSqrt'] = [sqrt(x) for x in movies_with_rating_from_studio.FrRating]
from math import log
movies_with_rating_from_studio['FrRatingLog'] = [log(x) for x in movies_with_rating_from_studio.FrRating]
from math import log10
movies_with_rating_from_studio['Log10FrEntries'] = [log10(x) for x in movies_with_rating_from_studio.FrEntries]
movies_with_rating_from_studio['Log10USOpening'] = [log10(x) for x in movies_with_rating_from_studio.USOpening]
movies_with_rating_from_studio['Log10ReleaseDelta'] = [log10(x+180) for x in movies_with_rating_from_studio.ReleaseDelta]
movies_with_rating_from_studio['InvReleaseDelta'] = [1/(x+1) for x in movies_with_rating_from_studio.ReleaseDelta]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
fig = plt.figure()
fig.set_figheight(10)
fig.set_figwidth(15)

plt.subplot(3,3,1)
plt.scatter(movies_with_rating_from_studio.FrRating , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does rating relate to Fr entries?')
plt.xlabel('Rating')
plt.ylabel('French entries (in millions)')

plt.subplot(3,3,2)
plt.scatter(movies_with_rating_from_studio.USOpening , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does US opening relate to Fr entries?')
plt.xlabel('Us opening')
plt.ylabel('French entries (in millions)')

plt.subplot(3,3,3)
plt.scatter(movies_with_rating_from_studio.ReleaseDelta , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does Release Delta relate to Fr entries?')
plt.xlabel('Release Delta in days')
plt.ylabel('French entries (in millions)')

plt.subplot(3,3,4)
plt.scatter(movies_with_rating_from_studio.FrRating2 , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does rating^2 relate to Fr entries?')
plt.xlabel('Rating^2')
plt.ylabel('French entries (in millions)')

plt.subplot(3,3,5)
plt.scatter(movies_with_rating_from_studio.USOpening2 , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does US opening^2 relate to Fr entries?')
plt.xlabel('US opening^2')
plt.ylabel('French entries (in millions)')

plt.subplot(3,3,6)
plt.scatter(movies_with_rating_from_studio.ReleaseDelta2 , movies_with_rating_from_studio.FrEntries/1e6 )
plt.title('How does Release Delta^2 relate to Fr entries?')
plt.xlabel('Release Delta^2 in days')
plt.ylabel('French entries (in millions)')

plt.subplot(3,3,7)
plt.scatter(movies_with_rating_from_studio.FrRating , movies_with_rating_from_studio.Log10FrEntries )
plt.title('How does rating relate to log10 Fr entries?')
plt.xlabel('Rating')
plt.ylabel('Log10 French entries')

plt.subplot(3,3,8)
plt.scatter(movies_with_rating_from_studio.Log10USOpening , movies_with_rating_from_studio.Log10FrEntries )
plt.title('How does rating relate to log10 US entries?')
plt.xlabel('Log 10 US opening')
plt.ylabel('Log10 French entries')

plt.subplot(3,3,9)
plt.scatter(movies_with_rating_from_studio.ReleaseDelta , movies_with_rating_from_studio.Log10FrEntries )
plt.title('How does Release Delta relate to Fr entries?')
plt.xlabel('Release Delta in days')
plt.ylabel('Log10 French entries')

plt.tight_layout()

png

We can feel some linearity between log base 10 of french entries and french rating, the delta in release dates as well as log base 10 of US opening

How about the relation between other parameters?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fig = plt.figure()
fig.set_figheight(5)
fig.set_figwidth(10)

plt.subplot(1,2,1)
plt.scatter(movies_with_rating_from_studio.FrRating , movies_with_rating_from_studio.Log10USOpening )
plt.title('How does rating relate to us opening?')
plt.xlabel('French rating')
plt.ylabel('Log10 US opening')

plt.subplot(1,2,2)
plt.scatter(movies_with_rating_from_studio.FrEntries/1e6 , movies_with_rating_from_studio.FrenchGross/1e6 )
plt.title('How does French entries relate to French Gross?')
plt.xlabel('French entries (in millions)')
plt.ylabel('French gross (in millions)')

plt.tight_layout()

png

Which are those movies that have a US opening that is not in the main set? we may need to analyze them further by getting additional data from Mojo…

1
2
select = movies_with_rating_from_studio[movies_with_rating_from_studio.Log10USOpening < 6.5]
print select[['MovieName','USOpening','USOpeningTheaters','Studio','ReleaseDelta','FrRating','FrEntries']]
                       MovieName  USOpening  USOpeningTheaters Studio  \
181          Inside Llewyn Davis     405411                  4    CBS   
188            Only God Forgives     313958                 78   RTWC   
216                       Trance     131145                  4   FoxS   
354                 Bachelorette     181494                 47   RTWC   
373  Beasts of the Southern Wild     169702                  4   FoxS   
386                  Ruby Sparks     140822                 13   FoxS   
394             Lay the Favorite      20998                 61   RTWC   
420                   Black Swan    1443809                 18   FoxS   
487                        Shame     349519                 10   FoxS   

     ReleaseDelta  FrRating  FrEntries  
181           -30       4.5     541937  
188           -58       3.4     432347  
216            33       2.9     211508  
354            40       2.2     154345  
373           168       4.1     288085  
386            70       2.7      27469  
394          -121       1.7      27169  
420            68       4.1    2617032  
487             5       4.0     435996  

Let’s improve our model

1
2
3
4
X = movies_with_rating_from_studio[['FrRating','USOpening','ReleaseDelta','Ones']]
Y = movies_with_rating_from_studio['FrEntries']
linmodel = sm.OLS(Y,X).fit()
print linmodel.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              FrEntries   R-squared:                       0.654
Model:                            OLS   Adj. R-squared:                  0.646
Method:                 Least Squares   F-statistic:                     87.50
Date:                Sun, 01 Feb 2015   Prob (F-statistic):           7.38e-32
Time:                        21:20:52   Log-Likelihood:                -2139.0
No. Observations:                 143   AIC:                             4286.
Df Residuals:                     139   BIC:                             4298.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
FrRating      3.402e+05   8.88e+04      3.832      0.000      1.65e+05  5.16e+05
USOpening        0.0322      0.002     13.029      0.000         0.027     0.037
ReleaseDelta -4997.6962   1786.774     -2.797      0.006     -8530.466 -1464.927
Ones         -6.423e+05    2.6e+05     -2.475      0.015     -1.16e+06 -1.29e+05
==============================================================================
Omnibus:                       36.008   Durbin-Watson:                   1.597
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               71.069
Skew:                           1.113   Prob(JB):                     3.69e-16
Kurtosis:                       5.640   Cond. No.                     1.73e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.73e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
1
2
3
4
5
6
7
plt.scatter(movies_with_rating_from_studio.FrRating, movies_with_rating_from_studio.FrEntries/1e6)

plt.title('How does rating relate to entries in France?')
plt.xlabel('Rating')
plt.ylabel('French Entries')

plt.plot(movies_with_rating_from_studio.FrRating, linmodel.predict(X)/1e6, 'r^')
[<matplotlib.lines.Line2D at 0x113f50c50>]

png

Using log base 10

1
2
3
4
X = movies_with_rating_from_studio[['FrRating','Log10USOpening','ReleaseDelta','Ones']]
Y = movies_with_rating_from_studio['Log10FrEntries']
linmodel = sm.OLS(Y,X).fit()
print linmodel.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Log10FrEntries   R-squared:                       0.536
Model:                            OLS   Adj. R-squared:                  0.526
Method:                 Least Squares   F-statistic:                     53.58
Date:                Sun, 01 Feb 2015   Prob (F-statistic):           4.46e-23
Time:                        21:20:54   Log-Likelihood:                -50.292
No. Observations:                 143   AIC:                             108.6
Df Residuals:                     139   BIC:                             120.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
FrRating           0.2277      0.039      5.836      0.000         0.151     0.305
Log10USOpening     0.4490      0.049      9.120      0.000         0.352     0.546
ReleaseDelta      -0.0037      0.001     -4.613      0.000        -0.005    -0.002
Ones               2.0031      0.369      5.425      0.000         1.273     2.733
==============================================================================
Omnibus:                        5.769   Durbin-Watson:                   1.544
Prob(Omnibus):                  0.056   Jarque-Bera (JB):                5.475
Skew:                          -0.376   Prob(JB):                       0.0647
Kurtosis:                       3.593   Cond. No.                         572.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
1
2
3
4
5
6
7
plt.scatter(movies_with_rating_from_studio.FrRating, movies_with_rating_from_studio.Log10FrEntries)

plt.title('How does rating relate to entries in France?')
plt.xlabel('Rating')
plt.ylabel('Log 10 French Entries')

plt.plot(movies_with_rating_from_studio.FrRating, linmodel.predict(X), 'r^')
[<matplotlib.lines.Line2D at 0x1146dbe90>]

png

The model seems OK. Let’s challenge it…

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.linear_model import LinearRegression

m = movies_with_rating_from_studio

X = m[['FrRating','Log10USOpening','ReleaseDelta','Ones']]
Y = m['Log10FrEntries']

sk_linmodel = LinearRegression()
sk_linmodel.fit(X,Y)
Ymodel=sk_linmodel.predict(X)

print X.head(2)
   FrRating  Log10USOpening  ReleaseDelta  Ones
1       3.6        7.861005            19     1
2       3.8        7.958199            -2     1
1
2
3
4
5
6
7
8
9
10
11
noise = Y-Ymodel
noise_mean = noise.mean()
noise_stdev = noise.std()
beta1 = sk_linmodel.coef_[0]
beta2 = sk_linmodel.coef_[1]
beta3 = sk_linmodel.coef_[2]
beta0 = sk_linmodel.intercept_

print 'Noise mean = %.3f' % noise_mean
print 'Noise std = %.3f' % noise_stdev
print 'B1 = %.3f, B2 = %.4f, B3 = %.5f, B0 = %.3f' % (beta1, beta2, beta3, beta0)
Noise mean = 0.000
Noise std = 0.345
B1 = 0.228, B2 = 0.4490, B3 = -0.00369, B0 = 2.003

Q-Q plot

1
2
3
measurements = np.random.normal(loc = noise_mean, scale = noise_stdev, size=length)
plt.scatter(sorted(noise), sorted(measurements), label='Noise versus Normal')
plt.legend(loc='upper left')
<matplotlib.legend.Legend at 0x1145c8510>

png

1
2
plt.scatter(noise, Ymodel, label='Noise versus Ymodel')
plt.legend(loc='upper left')
<matplotlib.legend.Legend at 0x11459bbd0>

png

Ridge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.linear_model import Ridge
Rmodel = Ridge(1.0)
Rmodel.fit(X,Y)
Yridge=Rmodel.predict(X)

Rnoise = Y-Yridge
Rnoise_mean = Rnoise.mean()
Rnoise_stdev = Rnoise.std()
Rbeta1 = Rmodel.coef_[0]
Rbeta2 = Rmodel.coef_[1]
Rbeta3 = Rmodel.coef_[2]
Rbeta0 = Rmodel.intercept_

print 'Ridge Noise mean = %.3f' % Rnoise_mean
print 'Ridge Noise std = %.3f' % Rnoise_stdev
print 'Ridge B1 = %.3f, B2 = %.4f, B3 = %.5f, B0 = %.3f' % (Rbeta1, Rbeta2, Rbeta3, Rbeta0)

print 'Noise mean = %.3f' % noise_mean
print 'Noise std = %.3f' % noise_stdev
print 'B1 = %.3f, B2 = %.4f, B3 = %.5f, B0 = %.3f' % (beta1, beta2, beta3, beta0)
Ridge Noise mean = 0.000
Ridge Noise std = 0.345
Ridge B1 = 0.225, B2 = 0.4405, B3 = -0.00370, B0 = 2.072
Noise mean = 0.000
Noise std = 0.345
B1 = 0.228, B2 = 0.4490, B3 = -0.00369, B0 = 2.003

How betas change with small variations of data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
sk_linmodel = LinearRegression()
sk_linmodel.fit(X[5:],Y[5:])
Ymodel=sk_linmodel.predict(X[5:])
beta1 = sk_linmodel.coef_[0]
beta2 = sk_linmodel.coef_[1]
beta3 = sk_linmodel.coef_[2]
beta0 = sk_linmodel.intercept_

print 'B1 = %.3f, B2 = %.4f, B3 = %.5f, B0 = %.3f' % (beta1, beta2, beta3, beta0)

sk_linmodel.fit(X[:-5],Y[:-5])
Ymodel=sk_linmodel.predict(X[:-5])
beta1 = sk_linmodel.coef_[0]
beta2 = sk_linmodel.coef_[1]
beta3 = sk_linmodel.coef_[2]
beta0 = sk_linmodel.intercept_

print 'B1 = %.3f, B2 = %.4f, B3 = %.5f, B0 = %.3f' % (beta1, beta2, beta3, beta0)
B1 = 0.214, B2 = 0.4390, B3 = -0.00367, B0 = 2.103
B1 = 0.236, B2 = 0.4376, B3 = -0.00351, B0 = 2.068

Fit a model to a training set. Calculate mean squared error on the training set. Then calculate it on the test set.

1
2
3
4
5
6
7
8
9
10
11
mtraining = m[:length*3/4]
mtest = m[length*3/4:]

Xtraining = mtraining[['FrRating','Log10USOpening','ReleaseDelta','Ones']]
Ytraining = mtraining['Log10FrEntries']

sk_linmodel_training = LinearRegression()
sk_linmodel_training.fit(Xtraining,Ytraining)

from sklearn.metrics import mean_squared_error
print 'Training error: %.2f' % mean_squared_error(Ytraining,sk_linmodel_training.predict(Xtraining))
Training error: 0.13
1
2
3
4
Xtest = mtest[['FrRating','Log10USOpening','ReleaseDelta','Ones']]
Ytest = mtest['Log10FrEntries']

print 'Test error: %.2f' % mean_squared_error(Ytest,sk_linmodel_training.predict(Xtest))
Test error: 0.09

Try polynomial fits from 0 to 7th order on French Rating

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
m = movies_with_rating_from_studio[['Log10FrEntries','Log10USOpening','ReleaseDelta','Ones','FrRating']]
for i in range(6):
    m['X'+str(i+2)]=[pow(x, i+2) for x in m['FrRating']]

mtraining = m[:length*3/4]
mtest = m[length*3/4:]


Ytraining = mtraining['Log10FrEntries']
Ytest = mtest['Log10FrEntries']

training_error=[]
test_error=[]
r_squared=[]
aic=[]

for i in range(8):
    Xtraining = mtraining.iloc[:,1:4+i]
    sk_linmodel_training = LinearRegression()
    sk_linmodel_training.fit(Xtraining,Ytraining)
    Xtest = mtest.iloc[:,1:4+i]

    training_error.append(mean_squared_error(Ytraining,sk_linmodel_training.predict(Xtraining)))
    test_error.append(mean_squared_error(Ytest,sk_linmodel_training.predict(Xtest)))

    r_squared.append(sk_linmodel_training.score(Xtraining,Ytraining))

    linmodel = sm.OLS(Ytraining, Xtraining).fit()
    res = str(linmodel.summary())
    i = res.split().index('AIC:')
    aic.append(res.split()[i+1])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
labels = ['-', 'X', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7']

fig = plt.figure()
fig.set_figheight(6)
fig.set_figwidth(8)
plt.xlabel('complexity')
plt.ylabel('value')
plt.subplot(2,2,1)
plt.xticks(range(8), labels)
plt.plot(range(8), training_error, 'g-', label='Training Error')
plt.legend(loc='upper right')
plt.subplot(2,2,2)
plt.xticks(range(8), labels)
plt.plot(range(8), test_error, 'm-', label='Test Error')
plt.legend(loc='upper left')
plt.subplot(2,2,3)
plt.xticks(range(8), labels)
plt.plot(range(8), r_squared, 'b-', label='R squared')
plt.legend(loc='upper left')
plt.subplot(2,2,4)
plt.xticks(range(8), labels)
plt.plot(range(8), aic, 'r-', label='AIC')
plt.legend(loc='upper left')
plt.tight_layout()

png

This does not look meaningful above 1.

Try polynomial fits from 0 to 7th order on Log 10 US Opening

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
m = movies_with_rating_from_studio[['Log10FrEntries','FrRating','ReleaseDelta','Ones','Log10USOpening']]
for i in range(6):
    m['X'+str(i+2)]=[pow(x, i+2) for x in m['Log10USOpening']]

mtraining = m[:length*3/4]
mtest = m[length*3/4:]

Ytraining = mtraining['Log10FrEntries']
Ytest = mtest['Log10FrEntries']

training_error=[]
test_error=[]
r_squared=[]
aic=[]

for i in range(8):
    Xtraining = mtraining.iloc[:,1:4+i]
    sk_linmodel_training = LinearRegression()
    sk_linmodel_training.fit(Xtraining,Ytraining)
    Xtest = mtest.iloc[:,1:4+i]

    training_error.append(mean_squared_error(Ytraining,sk_linmodel_training.predict(Xtraining)))
    test_error.append(mean_squared_error(Ytest,sk_linmodel_training.predict(Xtest)))

    r_squared.append(sk_linmodel_training.score(Xtraining,Ytraining))

    linmodel = sm.OLS(Ytraining, Xtraining).fit()
    res = str(linmodel.summary())
    i = res.split().index('AIC:')
    aic.append(res.split()[i+1])

fig = plt.figure()
fig.set_figheight(6)
fig.set_figwidth(8)
labels = ['-', 'X', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7']
plt.xlabel('complexity')
plt.ylabel('value')
plt.subplot(2,2,1)
plt.xticks(range(8), labels)
plt.plot(range(8), training_error, 'g-', label='Training Error')
plt.legend(loc='upper right')
plt.subplot(2,2,2)
plt.xticks(range(8), labels)
plt.plot(range(8), test_error, 'm-', label='Test Error')
plt.legend(loc='upper left')
plt.subplot(2,2,3)
plt.xticks(range(8), labels)
plt.plot(range(8), r_squared, 'b-', label='R squared')
plt.legend(loc='upper left')
plt.subplot(2,2,4)
plt.xticks(range(8), labels)
plt.plot(range(8), aic, 'r-', label='AIC')
plt.legend(loc='upper left')
plt.tight_layout()

png

The 4th order looks good but checking the t values - this test fails!

1
2
linmodel = sm.OLS(Ytraining, mtraining.iloc[:,1:8]).fit()
print linmodel.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Log10FrEntries   R-squared:                       0.571
Model:                            OLS   Adj. R-squared:                  0.545
Method:                 Least Squares   F-statistic:                     22.19
Date:                Sun, 01 Feb 2015   Prob (F-statistic):           1.86e-16
Time:                        21:21:07   Log-Likelihood:                -35.325
No. Observations:                 107   AIC:                             84.65
Df Residuals:                     100   BIC:                             103.4
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
FrRating           0.1613      0.052      3.126      0.002         0.059     0.264
ReleaseDelta      -0.0037      0.001     -3.667      0.000        -0.006    -0.002
Ones            -121.8224     66.970     -1.819      0.072      -254.688    11.043
Log10USOpening    77.0760     43.900      1.756      0.082       -10.021   164.173
X2               -17.3381     10.644     -1.629      0.106       -38.455     3.779
X3                 1.7062      1.132      1.507      0.135        -0.540     3.952
X4                -0.0617      0.045     -1.384      0.169        -0.150     0.027
==============================================================================
Omnibus:                        4.278   Durbin-Watson:                   1.378
Prob(Omnibus):                  0.118   Jarque-Bera (JB):                3.820
Skew:                          -0.325   Prob(JB):                        0.148
Kurtosis:                       3.658   Cond. No.                     7.27e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.27e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Let’s check now if our data set is sufficient to make a conclusion by plotting the learning curve.

Fit a model to only the first 4 data points (m=1). Then to first 8 (m=2). Then to first 12 (m=3). In this manner until fiting the entire training set. For each step, calculate the training error and the test error. Plot both over m.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
m = movies_with_rating_from_studio[['Log10FrEntries','FrRating','Log10USOpening','ReleaseDelta','Ones']]


training_error=[]
test_error=[]

for i in range(1,29):
    Ytraining = m.iloc[0:i*4,0:1]
    Ytest = m.iloc[i*4:i*5,0:1]
    Xtraining = m.iloc[0:i*4,1:5]
    sk_linmodel_training = LinearRegression()
    sk_linmodel_training.fit(Xtraining,Ytraining)
    Xtest = m.iloc[i*4:i*5,1:5]

    training_error.append(mean_squared_error(Ytraining,sk_linmodel_training.predict(Xtraining)))
    test_error.append(mean_squared_error(Ytest,sk_linmodel_training.predict(Xtest)))


labels = range(1,29)
fig = plt.figure()
fig.set_figheight(10)
fig.set_figwidth(14)
plt.xlabel('complexity')
plt.ylabel('value')
plt.subplot(2,2,1)
plt.xticks(range(28), labels)
plt.plot(range(28), training_error, 'g-', label='Training Error')
plt.legend(loc='upper left')
plt.plot(range(28), test_error, 'm-', label='Test Error')
plt.legend(loc='upper right')

plt.tight_layout()

png

The error seems converging…

Now let’s get some insights from the model!

1
2
3
4
import random
def underlying_gross_model(FrRating, Log10USOpening, ReleaseDelta, beta0, beta1, beta2, beta3, noise_mean, noise_stdev):
    log10rev = beta1 * FrRating + beta2 * Log10USOpening + beta3 * ReleaseDelta + beta0 + random.gauss(noise_mean, noise_stdev)
    return pow(10,log10rev)
1
2
3
4
5
6
mean_rating = float(movies_with_rating_from_studio[['FrRating']].mean())
print 'Mean French Rating: %0.2f' % mean_rating
mean_usopening = float(movies_with_rating_from_studio[['USOpening']].mean())
print 'Mean US opening: %0.2f' % mean_usopening
mean_releasedelta = float(movies_with_rating_from_studio[['ReleaseDelta']].mean())
print 'Mean Delta Release date: %0.2f' % mean_releasedelta
Mean French Rating: 2.80
Mean US opening: 30334400.05
Mean Delta Release date: 25.03
1
2
3
mean_entries = underlying_gross_model(mean_rating, log10(mean_usopening), mean_releasedelta,
                                      beta0, beta1, beta2, beta3, 0, 0)
print 'Mean entries: %i' % mean_entries
Mean entries: 820035
1
2
3
mean_entries_minus1star = underlying_gross_model(mean_rating-0.05, log10(mean_usopening), mean_releasedelta,
                                                beta0, beta1, beta2, beta3, 0, 0)
print 'Cost of one star from 1 journalist / 20: %i entries' % (mean_entries-mean_entries_minus1star)
Cost of one star from 1 journalist / 20: 21948 entries
1
2
3
mean_entries_minus20stars = underlying_gross_model(mean_rating-1, log10(mean_usopening), mean_releasedelta,
                                                beta0, beta1, beta2, beta3, 0, 0)
print 'Cost of one star from all journalists: %i entries' % (mean_entries-mean_entries_minus20stars)
Cost of one star from all journalists: 343398 entries
1
2
3
mean_entries_plus1day = underlying_gross_model(mean_rating, log10(mean_usopening), mean_releasedelta+1,
                                                beta0, beta1, beta2, beta3, 0, 0)
print 'Cost of one day of delay to release: %i entries' % (mean_entries-mean_entries_plus1day)
Cost of one day of delay to release: 6596 entries
1
2
3
mean_entries_plus1week = underlying_gross_model(mean_rating, log10(mean_usopening), mean_releasedelta+7,
                                                beta0, beta1, beta2, beta3, 0, 0)
print 'Cost of one week of delay to release: %i entries' % (mean_entries-mean_entries_plus1week)
Cost of one week of delay to release: 45074 entries