Netflix! What started in 1997 as a DVD rental service has since exploded into one of the largest entertainment and media companies.

Given the large number of movies and series available on the platform, it is a perfect opportunity to flex your exploratory data analysis skills and dive into the entertainment industry. Our friend has also been brushing up on their Python skills and has taken a first crack at a CSV file containing Netflix data. They believe that the average duration of movies has been declining. Using your friends initial research, you'll delve into the Netflix data to see if you can determine whether movie lengths are actually getting shorter and explain some of the contributing factors, if any.

You have been supplied with the dataset netflix_data.csv , along with the following table detailing the column names and descriptions:

The data¶

netflix_data.csv¶

Column Description
show_id The ID of the show
type Type of show
title Title of the show
director Director of the show
cast Cast of the show
country Country of origin
date_added Date added to Netflix
release_year Year of Netflix release
duration Duration of the show in minutes
description Description of the show
genre Show genre

Importing libraries

In [ ]:
# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt

Loading csv file into netflix_df DataFrame, examining the data

In [ ]:
# Loading data from .csv file
netflix_df = pd.read_csv('./netflix_data.csv')
In [ ]:
# Examining the head
netflix_df.head()
Out[ ]:
show_id type title director cast country date_added release_year duration description genre
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R... Brazil August 14, 2020 2020 4 In a future where the elite inhabit an island ... International TV
1 s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico December 23, 2016 2016 93 After a devastating earthquake hits Mexico Cit... Dramas
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore December 20, 2018 2011 78 When an army recruit is found dead, his fellow... Horror Movies
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly... United States November 16, 2017 2009 80 In a postapocalyptic world, rag-doll robots hi... Action
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States January 1, 2020 2008 123 A brilliant group of students become card-coun... Dramas
In [ ]:
# Examining the tail
netflix_df.tail()
Out[ ]:
show_id type title director cast country date_added release_year duration description genre
7782 s7783 Movie Zozo Josef Fares Imad Creidi, Antoinette Turk, Elias Gergi, Car... Sweden October 19, 2020 2005 99 When Lebanon's Civil War deprives Zozo of his ... Dramas
7783 s7784 Movie Zubaan Mozez Singh Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... India March 2, 2019 2015 111 A scrappy but poor boy worms his way into a ty... Dramas
7784 s7785 Movie Zulu Man in Japan NaN Nasty C NaN September 25, 2020 2019 44 In this documentary, South African rapper Nast... Documentaries
7785 s7786 TV Show Zumbo's Just Desserts NaN Adriano Zumbo, Rachel Khoo Australia October 31, 2020 2019 1 Dessert wizard Adriano Zumbo looks for the nex... International TV
7786 s7787 Movie ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS Sam Dunn NaN United Kingdom March 1, 2020 2019 90 This documentary delves into the mystique behi... Documentaries
In [ ]:
# Data info
netflix_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   duration      7787 non-null   int64 
 9   description   7787 non-null   object
 10  genre         7787 non-null   object
dtypes: int64(2), object(9)
memory usage: 669.3+ KB
In [ ]:
# Check exist type of 'type'
netflix_df.type.unique()
Out[ ]:
array(['TV Show', 'Movie'], dtype=object)

Subsetting the Movies

In [ ]:
# Subsetting the Movies and store as netflix_subset
netflix_subset = netflix_df[netflix_df['type'] == 'Movie']
In [ ]:
# Subset info => 2410 entries removed
netflix_subset.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5377 entries, 1 to 7786
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       5377 non-null   object
 1   type          5377 non-null   object
 2   title         5377 non-null   object
 3   director      5214 non-null   object
 4   cast          4951 non-null   object
 5   country       5147 non-null   object
 6   date_added    5377 non-null   object
 7   release_year  5377 non-null   int64 
 8   duration      5377 non-null   int64 
 9   description   5377 non-null   object
 10  genre         5377 non-null   object
dtypes: int64(2), object(9)
memory usage: 504.1+ KB

Starting analysis

In [ ]:
# Minimize the dataset
netflix_movies = netflix_subset[['title', 'country', 'genre', 'release_year', 'duration']]
netflix_movies.head()
Out[ ]:
title country genre release_year duration
1 7:19 Mexico Dramas 2016 93
2 23:59 Singapore Horror Movies 2011 78
3 9 United States Action 2009 80
4 21 United States Dramas 2008 123
6 122 Egypt Horror Movies 2019 95
In [ ]:
# Filter for durations shorter than 60 minutes
short_movies = netflix_movies[netflix_movies.duration < 60]
In [ ]:
# Create an empty list
colors = []
# Iterate over rows of netflix_movies
for l, r in netflix_movies.iterrows() :
    if r["genre"] == "Children" :
        colors.append("red")
    elif r["genre"] == "Documentaries" :
        colors.append("blue")
    elif r["genre"] == "Stand-Up":
        colors.append("green")
    else:
        colors.append("purple")
        
# Inspect the first 10 values in your list        
colors[:10]
Out[ ]:
['purple',
 'purple',
 'purple',
 'purple',
 'purple',
 'purple',
 'purple',
 'purple',
 'purple',
 'blue']

Plotting time!

In [ ]:
# Initial a new figure and set it's size
fig = plt.figure(figsize=(12,8))
plt.scatter(x = netflix_movies['release_year'],
            y = netflix_movies['duration'],
            c = colors,
            alpha = 0.8)
plt.title('Movie Duration by Year of Release')
plt.xlabel('Release year')
plt.ylabel('Duration (min)')
plt.show()
plt.clf()
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

The scatter plot above show there is no relationship between the release year and duration of the film.

Are we certain that movies are getting shorter?

In [ ]:
answer = 'no'