top of page

Python Project: Youtube Analysis Part 2: Deeper look into the Channel

  • Writer: Sonu Kothari
    Sonu Kothari
  • Nov 25, 2023
  • 6 min read

Updated: Dec 22, 2023


We will be importing the libraries that we will need. If you do not have the libraries already installed you can use "pip install" to install the libraries.

from googleapiclient.discovery 
import build 
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
import nltk  from nltk.corpus 
import stopwords  from nltk.tokenize 
import word_tokenize  
nltk.download('stopwords')
nltk.download('punkt')
from wordcloud import WordCloud

As we have already gathered the information regarding the channel in part 1 of this project we will be working on getting the details about one of my favorite channel. Let's start with extracting the playlist list ID.

playlist_id = channel_data.loc[channel_data['Channel_name']=='How to Power BI','playlist_id'].iloc[0]

playlist_id

Now that we have our playlist ID, we will extract the video ids of all the videos in the "How to PowerBI" channel.

def get_video_ids(youtube, playlist_id):
request = youtube.playlistItems().list(part='contentDetails', playlistId = playlist_id, maxResults = 50)
    
video_ids = []

    response = request.execute() 

    for i in range(len(response['items'])):

        video_ids.append(response['items'][i]['contentDetails']                  ['videoId'])
    
next_page_token = response.get('nextPageToken')
more_pages = True
    
while more_pages:
        if next_page_token is None:
            more_pages = False
        else:
            request = youtube.playlistItems().list(
            part='contentDetails',
            playlistId = playlist_id,
            maxResults = 50,
            pageToken = next_page_token
            response = request.execute()
            
            for i in range(len(response['items'])):
                video_ids.append(response['items'][i]['contentDetails']['videoId'])

            next_page_token = response.get('nextPageToken')

    return video_ids

video_ids = get_video_ids(youtube, playlist_id)
video_ids

Okay, so we got our video ids, now we need to extract video details. We will extract title, published date, views, likes, favourite Count, comments for each video id. We will be creating a dictionary to collect all this information.

def get_video_details(youtube, video_ids):
 all_video_stats = []

    for i in range(0,len(video_ids), 50):
        request = youtube.videos().list(
                        part='snippet,statistics',
                        id=','.join(video_ids[i:i+50]))
        response = request.execute()

        for video in response['items']:
            video_stats = dict(Title = video['snippet']['title'],
                              Published_date = video['snippet']                                                                 				['publishedAt'],
                              Views = video['statistics']['viewCount'],
                              Likes = video['statistics']['likeCount'],
                              favoriteCount = video['statistics']['favoriteCount'],
                              Comments = video['statistics']['commentCount'] 

                              )

            all_video_stats.append(video_stats)

    return all_video_stats
video_details = get_video_details(youtube, video_ids)
video_details

#Turning video_details into dataframe
video_data = pd.DataFrame(video_details)
video_data


Let's make sure that our dataframe 'video_data' is having correct datatypes for each column

video_data['Published_date'] = pd.to_datetime(video_data['Published_date']).dt.date

video_data['Views'] = pd.to_numeric(video_data['Views'])

video_data['Likes'] = pd.to_numeric(video_data['Likes'])

video_data['favoriteCount'] = pd.to_numeric(video_data['favoriteCount'])

video_data['Comments'] = pd.to_numeric(video_data['Comments'])

As we can see we have the published date column and we can extract numerous information from it. Let's create a month column using our 'published_date' column.

video_data['Month'] = pd.to_datetime(video_data['Published_date']).dt.strftime('%b')

video_data

As we can observe now we have a new column that has a month which will surely help us while analyzing and visualizing the data. Let's check out how many videos were uploaded each month.


videos_per_month = video_data.groupby('Month', as_index=False).size()
videos_per_month


Hmm, it does not come in the correct order and we need it month-wise. So let's provide Python with the sort order so it displays the result set as per our requirement. This below line of code is creating a CategoricalIndex based on the 'Month' column of the DataFrame videos_per_month. The order of the categories is determined by the sort_order list, and the resulting CategoricalIndex is assigned to be the index of the DataFrame. This can be useful for tasks such as sorting, grouping, and analyzing data based on the categorical nature of the months.


sort_order = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

videos_per_month.index = pd.CategoricalIndex(videos_per_month['Month'], categories=sort_order, ordered=True)

videos_per_month.sort_index()

And voilà, we have our desired result set. This looks simple though, as a data analyst I want to make this look beautiful. Let's visualize this data and have some fun!

ax2 = sns.barplot(x='Month', y='size', data=videos_per_month)

Wow, this looks great. It looks like the YouTuber uploaded a large amount of videos in October compared to other months and February seems to be the month where fewer videos were uploaded. Well, we have extracted only a few details as of now for each video but let's get some more information like tags, comment count, caption, definition, duration, and title. The below code will focus on getting more information about each video.


def get_video_details(youtube, video_ids):
    
    all_video_info = []
    
    for i in range(0,len(video_ids), 50):
        request = youtube.videos().list(
                        part='snippet,ContentDetails,statistics',
                        id=','.join(video_ids[i:i+50]))
        response = request.execute()
        
        for video in response['items']:
            stats_to_keep = {'snippet': ['channelTitle','title','tags','publishedAt'],
                            'statistics': ['viewCount','likeCount','favouriteCount','commentCount'],
                            'contentDetails': ['duration','definition','caption']
                            }
            video_info = {}
            
            video_info['video_id'] = video['id']
            
            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None
            
            all_video_info.append(video_info)

    return pd.DataFrame(all_video_info)

#let's create the dataframe and check what we got here
video_df = get_video_details(youtube, video_ids)
video_df
   

Looks like we got all the information that we needed and we have now 12 columns. Surely, we will be doing some data pre-processing so we can make sure it's correct data type and following the correct format. Let's start by checking if any columns have null values.

video_df.isnull().any()

Checking the data types for each column

video_df.dtypes

We are going to change the data types of a few columns which need to be numeric.

numeric_cols = ['viewCount','likeCount','favouriteCount','commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)

Now I am thinking of extracting the weekday from our 'publishedAt' column. These lines of code aim to convert the 'publishedAt' column, which contains date and time information in string format, into a datetime object. Then, a new column 'publishDayName' is created to store the day of the week corresponding to each publication date. This can be useful for further analysis based on the day of the week when videos are published.


from dateutil import parser

#publish day in the week

video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: parser.parse(x))
video_df['publishDayName'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A"))

Now I don't really need tags but what I need from that column is tag count since that will be helpful to understand how it affects the video.

#add tag count column

video_df['tagCount'] = video_df['tags'].apply(lambda x: 0 if x is None else len(x))
video_df

Great, so far we have 14 columns. Now we need to handle our duration column. These below lines of code parse the duration strings in the 'duration' column of the DataFrame using the isodate library, and then convert the parsed duration objects to the total number of seconds. The resulting 'durationSecs' column contains the duration of each video in seconds, which can be useful for numerical analysis and comparisons.

import isodate
video_df['durationSecs']=video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs']=video_df['durationSecs'].astype('timedelta64[s]')

So, now we have tackled that too. Let's do some analysis with these columns. I want to check the view distribution for this channel.


# Increase figure size
plt.figure(figsize=(10, 6))

# Set color palette (you can choose any color palette you prefer)
sns.set_palette("pastel")

# Create the violin plot
ax = sns.violinplot(x=video_df['channelTitle'], y=video_df['viewCount'])

# Add labels and title
plt.xlabel('Channel Title', fontsize=12)
plt.ylabel('View Count', fontsize=12)
plt.title('View Count by Channel Title', fontsize=14)

# Rotate x-axis labels for better visibility
plt.xticks( horizontalalignment='right')

# Adjust font sizes
ax.tick_params(axis='both', labelsize=10)

# Add gridlines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Add some padding to the plot
plt.tight_layout()

# Show the plot
plt.show()

Great, this shows us view count distribution across this channel. Now, let's create a scatter plot to see if we can find any relationship between views vs likes and views vs comments.

fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Set color palette (you can choose any color palette you prefer)
sns.set_palette("pastel")

# Create the scatter plots
sns.scatterplot(data=video_df, x='commentCount', y='viewCount', ax=ax[0])
sns.scatterplot(data=video_df, x='likeCount', y='viewCount', ax=ax[1])

# Add labels and title for the first plot
ax[0].set_xlabel('Comment Count', fontsize=12)
ax[0].set_ylabel('View Count', fontsize=12)
ax[0].set_title('View Count vs. Comment Count', fontsize=14)

# Add labels and title for the second plot
ax[1].set_xlabel('Like Count', fontsize=12)
ax[1].set_ylabel('View Count', fontsize=12)
ax[1].set_title('View Count vs. Like Count', fontsize=14)

# Adjust font sizes
for a in ax:
    a.tick_params(axis='both', labelsize=10)

# Add gridlines for better readability
for a in ax:
    a.grid(linestyle='--', alpha=0.7)

# Add some padding to the plots
plt.tight_layout()

# Show the plots
plt.show()

Great, we see some positive relationship between these factors and it makes sense too since more view count can lead to more like count and comment count. Now, let's create a word cloud of our video titles, I love this visual it makes it so easy to see what words are used majorly in the video titles. The below-given code will create this wordcloud.

stop_words = set(stopwords.words('english'))
video_df['title_no_stopwords'] = video_df['title'].apply(lambda x: [item for item in str(x).split() if item not in stop_words])

all_words = list([a for b in video_df['title_no_stopwords'].tolist() for a in b])
all_words_str = ' '.join(all_words) 


def plot_cloud(wordcloud):
    plt.figure(figsize=(30, 20))
    plt.imshow(wordcloud) 
    plt.axis("off");

wordcloud = WordCloud(width = 2000, height = 1000, random_state=1, background_color='black', 
                      colormap='viridis', collocations=False).generate(all_words_str)
plot_cloud(wordcloud)

This is great! I love this channel and when you see this word cloud you will know what this channel is all about, right?

Okay, now I want this dataframe in SQL server, I am not planning to do much in SQL server but I do want to make a pipeline and maybe even clean the data a little bit and normalize it too. Well, let's connect the dataframe to the SQL server.

import pyodbc
from sqlalchemy import create_engine


conn_str = (
    r'DRIVER={ODBC Driver 17 for SQL Server};'
    r'SERVER=YOUR_SERVER_NAME;'
    r'DATABASE=YT_analysis_P1;'
    r'UID=YOUR_USERNAME;'
    r'PWD= YOUR_PASSWORD;'
    r'Trusted_Connection=yes;' # if you have the windows authentication on
)

# Establish a connection
conn = pyodbc.connect(conn_str)

Done! We are now connected with our server. Let's load the data in our database.

engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(conn_str))

import sqlalchemy

# Specify the table name where you want to insert the data
table_name = 'test_1'

column_types = {
    'video_id': sqlalchemy.String,
    'channelTitle': sqlalchemy.String,
    'title': sqlalchemy.String,
    'publishedAt': sqlalchemy.DateTime,
    'viewCount': sqlalchemy.Float,
    'likeCount': sqlalchemy.Float,
    'commentCount': sqlalchemy.Float,
    'duration': sqlalchemy.String,
    'definition': sqlalchemy.String,
    'caption': sqlalchemy.String,
    'publishDayName': sqlalchemy.String,
    'durationSecs': sqlalchemy.Integer,
    'tagCount': sqlalchemy.Integer,
    'title_no_stopwords': sqlalchemy.String
}

# Insert the DataFrame into SQL Server
# video_df.to_sql(table_name, engine, if_exists='append', index=False, dtype=column_types)

try:
    video_df.to_sql(table_name, engine, if_exists='append', index=False, dtype=column_types)
    print("Data inserted successfully!")
except Exception as e:
    print(f"Error: {str(e)}")

And that's all we have now loaded our data in SQL server. I will be connecting SQL server to PowerBI and doing the data visualization there. Check out how our visualization turned out for this project in part 3!!

bottom of page