The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine.
Factors that are used to calculate the Billboard Top Tracks
I scraped this data from Wikipedia's list of Billboard #1 Tracks from 2010 - 2019 https://en.wikipedia.org/wiki/Billboard_Hot_100
Spotify has been a leader in enabling discovery of new music. The company uses audio analysis models to extract features about the song - how danceable it is, how energetic it is, among other things. They use these features to robustly predict what songs a person is more likely to love.
Lucky for us, Spotify gives access to their API here https://developer.spotify.com/
Spotipy is a sweet Python package that makes it easy to connect
We'll be primarily using two Spotify API endpoints
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
cid = 'e33f0325007b4844bb2d8b79a15f94c1'
secret = 'c4e8f9c0404f4aafa1ef2786d4d2a58d'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from fbprophet import Prophet
from sklearn.metrics import mean_squared_error
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
billboard = pd.read_csv('billboardtop.csv', parse_dates=["date"])
billboard.head()
#Cleaning up
billboard['song'] = billboard['song'].apply(lambda x: x.split('"')[1])
billboard['aSearch'] = billboard['artist'].apply(lambda x : " ".join(x.split(" ")[:2]))
billboard['year'] = billboard['date'].dt.year
billboard['searcher'] = billboard['song'] + " " + billboard['aSearch']
billboard.head()
#Example
sp.search('See You Again')
def getTrackURI(searcher):
result = sp.search(searcher)
try:
obj1 = result['tracks']['items'][0]['uri']
except:
obj1 = None
return obj1
def getArtists(searcher):
result = sp.search(searcher)
try:
obj2 = result['tracks']['items'][0]['artists']
names = []
for name in obj2:
names.append(name['name'])
obj2 = ",".join(names)
except:
obj2 = ""
return obj2
billboard['uri'] = billboard['searcher'].apply(getTrackURI)
billboard['artists'] = billboard['searcher'].apply(getArtists)
billboard = billboard.dropna()
billboard.head()
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
liveness float Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness float The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode int Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness float Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
audio_features = []
for uri in billboard['uri'].values:
audio_features.append(sp.audio_features(uri)[0])
audio_features = pd.DataFrame(audio_features)[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
'acousticness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']]
audio_features['duration'] = audio_features['duration_ms']/60000
audio_features.head()
Season
billboard['year'] = billboard['date'].dt.year
billboard['month'] = billboard['date'].dt.month
def season(month):
if month in [4, 5, 6]:
return 'Spring'
elif month in [7, 8, 9]:
return 'Summer'
elif month in [10, 11, 12]:
return 'Fall'
else:
return 'Winter'
billboard['season'] = billboard['month'].apply(season)
billDf = pd.concat([billboard, audio_features], axis=1)[['song', 'id', 'artists','date','year','season', 'artist', 'weeks','danceability', 'energy',
'key', 'loudness', 'mode', 'speechiness',
'acousticness', 'liveness', 'valence', 'tempo', 'duration', 'time_signature', 'uri' ]]
billDf.head()
billDf['happy'] = billDf['valence'] >= 0.5
billDf['acoustic'] = billDf['acousticness'] >= 0.5
billDf['rap'] = (billDf['speechiness'] >=0.33) & (billDf['speechiness'] <=0.66)
def tidy_split(df, column, sep='|', keep=False):
indexes = list()
new_values = list()
df = df.dropna(subset=[column])
for i, presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
if keep and len(values) > 1:
indexes.append(i)
new_values.append(presplit)
for value in values:
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes, :].copy()
new_df[column] = new_values
return new_df
artists = tidy_split(billDf, 'artists', ',').dropna()
artists.head(6)
yearsAcross = billDf.groupby('year')['id'].count()
f, ax = plt.subplots(figsize=(10, 6))
sns.barplot(yearsAcross.index, yearsAcross.values, palette="rocket")
plt.title('No. of Billboard #1 Tracks Yearly')
plt.xlabel("No. of Tracks")
plt.ylabel("Year")
len(billDf)
We can see that most of the Billboard #1 Tracks were Solo songs. A significant number of Duets made it to the top of the chart, but beyond 2, we see very marginal returns
f, ax = plt.subplots(figsize=(10, 6))
noPeople = artists.groupby('id').count()['artists'].value_counts()
sns.barplot(noPeople.index, noPeople.values, palette="rocket")
plt.title('Artist Collaboration on a #1 Billboard Track?')
plt.xlabel("No. of Artists on the track")
plt.ylabel("Frequency")
We can see here Katy Perry had a whopping 7 tracks on Billboard #1, followed by a host of other artists (Maroon 5, Justin Bieber, Bruno Mars, Adele, etc) with 4 tracks.
f, ax = plt.subplots(figsize=(10, 6))
artistsCounts = artists.groupby('artists')['id'].count().sort_values(ascending=False)[:10]
sns.barplot(x=artistsCounts.index, y=artistsCounts.values)
plt.xticks(rotation=75)
plt.ylabel('No. of Songs in Billboard #1')
plt.xlabel('Artist')
plt.title('Songs in Billboard #1 : 2010-2019')
We can see here that Katy Perry had 7 tracks in a short span between 2010 and 2013, but hasn't landed the #1 in the rest of the decade. You could make a claim that Katy Perry hasn't stayed relevant in the perspective of #1 Songs
f, ax = plt.subplots(figsize=(10, 6))
sns.barplot(artists[artists['artists']=="Katy Perry"]['year'].value_counts().index, artists[artists['artists']=="Katy Perry"]['year'].value_counts().values, palette="rocket")
plt.title('Katy Perry #1 Tracks')
plt.ylabel('Count')
Lady Gaga tops the bar here, with Billboard #1 songs across the span of 8 years. But Maroon 5 and Bruno Mars have significantly more Top Songs (4 each) spread over 7 years. We could claim that Bruno Mars and Maroon 5 have been super reliable in rendering top of the charts songs
We can see here Katy Perry has the highest number of Top tracks, but only over three years in the Billboard charts.
New artists like Post Malone and Camilla Cabello have stayed on the Billboard #1 for fewer years, naturally. But in a short span, they've secured 3 top tracks meaning they're on the rise
adf = pd.DataFrame()
timeline = artists.groupby('artists')['year'].max() - artists.groupby('artists')['year'].min()
counts = artists.groupby('artists')['year'].count()
scores = timeline*counts/10
adf['timeline'] = timeline
adf['counts'] = counts
adf['score'] = scores
adf['artists'] = timeline.index
adf = adf.sort_values(by="timeline", ascending=False)
f, ax = plt.subplots(figsize=(15, 6))
sns.barplot(adf['artists'][:15], adf['timeline'][:15])
plt.plot(adf['artists'][:15], adf['counts'][:15], color="orange", label="No. of Billboard #1 Tracks")
plt.title('How Relevant were the Top Artists?')
plt.ylabel('Number of Years')
plt.xticks(rotation=75)
plt.legend()
for i, txt in enumerate(adf['counts'][:15]):
ax.annotate(txt, (adf['counts'][:15].index[i], adf['counts'][:15].values[i]))
Valence is a measure of how happy the song is on a scale of 0 to 1.
Presidential Election cycles always yield interesting results when analying time series. Here we can see that the 2016 Presidential Election, with a sharp drop, saw people listening to more sad music than ever. I'll leave the interpretation of the results to you.
billyear = billDf.groupby('year').mean().reset_index()
billyear['year'] = billyear['year'].apply(int)
f, ax = plt.subplots(figsize=(10, 6))
sns.pointplot(x="year", y="valence", data=billyear)
plt.axvline(2, linestyle='--', color='blue')
plt.axvline(6, linestyle='--', color='red')
plt.text(1,0.51, '2012 Presidential Election', color='black', fontsize=8)
plt.text(5,0.51, '2016 Presidential Election', color='black', fontsize=8)
plt.xticks(rotation=70)
plt.title('How Happy were #1 songs during Presidential Election years?')
plt.ylabel('Valence -- Happiness Index')
There's been a strong movement back towards music that is of lower loudness, with a -1.5 decibels drop over the decade. The previous decades saw artists competing with each other on the loudness scale, so as to be more poppy on Radio Channels. Songs that are mastered to be loud, often have poor dynamics and are not great for enjoying on good audio gear.
In addition, songs are now mastered to be played on Youtube, Apple Music and Spotify - all of which impose loudness thresholds to maximize audio quality. Thanks to the decentivizing, and an increasing preference for audio quality by consumers - tracks are becoming less loud.
f, ax = plt.subplots(figsize=(10, 6))
sns.regplot(x="year", y="loudness", data=billyear)
plt.xticks(rotation=70)
plt.title('Loudness across the Years')
plt.ylabel('Loudness dB')
Spotify defines Energy as how bright and fast a song is - songs that have a lot of high frequencies - think of blaring synths, or simply singers shreaking (yikes!) - burn up pretty high on the energy spectrum.
We can see that Energy has been steadily dropping over the decade by over 25%, and has been becoming a non issue for Music Producers. Parallely, we can see that Acoustic tracks have seen a 100% lift across the decade. People are definitely liking gentler music now
f, ax = plt.subplots(figsize=(10, 6))
sns.regplot(x="year", y="energy", data=billyear, label="Energy")
sns.regplot(x="year", y="acousticness", data=billyear, label="Acousticness")
plt.legend()
plt.xticks(rotation=70)
plt.title('Gentler music is winning')
plt.ylabel('Metric')
Danceability has seen a significant upward trend across the decade, despite the Energy on a decline. People like to dance to tracks that are less energetic? How do we reconcile the differences
Let's look at tracks on both ends of the spectrum -
2010-11: Teenage Dream by Katy Perry and Rude Boy by Rihanna ruled the charts. These are high-energy tracks, with bright frequencies dominating.
2018-19: Old Town Road by Lil Nas X, Sucker by Jonas Brothers, Without Me by Halsey are all very warm and dark tracks (if one could visualize music) - but they are utterly groovy.
There was an association previously that energetic tracks are more danceable, but this decade has been gearing towards the groovy and funky above energetic pattern.
f, ax = plt.subplots(figsize=(12, 7))
sns.regplot(x="year", y="energy", data=billyear, label="Energy")
sns.regplot(x="year", y="danceability", data=billyear, label="Danceability")
plt.legend()
plt.xticks(rotation=70)
plt.title('Energy vs Danceability')
plt.ylabel('Metric')
plt.text(2010,0.6, 'Rude Boy - Rihanna', color='black', fontsize=8)
plt.text(2008,0.7, 'Teenage Dream - Katy Perry', color='black', fontsize=8)
plt.text(2009,0.65, 'Not Afraid - Eminem', color='black', fontsize=8)
plt.text(2019,0.77, 'Sucker - Jonas Brothers', color='black', fontsize=8)
plt.text(2018,0.7, 'Old Town Road - Lil Nas X', color='black', fontsize=8)
plt.text(2018,0.75, 'Without Me - Halsey', color='black', fontsize=8)
billDf[(billDf.danceability>0.7) & (billDf.energy < 0.6)].sort_values(by="year").dropna()[-5:][['song', 'artist']]
billDf[(billDf.danceability>0.7) & (billDf.energy > 0.6)].sort_values(by="year").dropna()[:5][['song', 'artist']]
seasonsDf = billDf.groupby('season').agg({'valence':np.median, 'danceability':np.median, 'happy':np.mean, 'duration':np.mean})
seasonsDf['order'] = [4,2,3,1]
seasonsDf.sort_values(by="order", inplace=True)
seasonsDf
seasonsDf['valence'] = (seasonsDf['valence'] - np.mean(seasonsDf['valence']))*100/np.mean(seasonsDf['valence'])
seasonsDf['danceability'] = (seasonsDf['danceability'] - np.mean(seasonsDf['danceability']))*100/np.mean(seasonsDf['danceability'])
seasonsDf['happy'] = (seasonsDf['happy'] - np.mean(seasonsDf['happy']))*100/np.mean(seasonsDf['happy'])
seasonsDf['duration'] = (seasonsDf['duration'] - np.mean(seasonsDf['duration']))*100/np.mean(seasonsDf['duration'])
seasonsDf
The data seems to be pointing to the idea that people like listening to more happy songs during Winter. This seems counterintuitive, but from a psychological standpoint people are more likely to be sad during Winter. Music can often function as an antidote - a happy song on a bad day can really elevate the mood. I have been looping through Feels by Calvin Harris to pump myself up and avoid the perils of the Minnesotan Winter.
f, ax = plt.subplots(figsize=(10, 6))
sns.barplot(seasonsDf.index, seasonsDf.happy, palette="vlag")
plt.title("Effect of Seasons on Song Happiness")
plt.ylabel('% Increase in Happiness Index')
plt.axhline(0, linestyle='-', color='grey')
plt.text(2.3,0.5, 'Baseline: Mean Happiness', color='black', fontsize=12)
Summer has the highest number of danceable songs - and this makes sense.
f, ax = plt.subplots(figsize=(10, 6))
sns.barplot(seasonsDf.index, seasonsDf.danceability, palette="rocket")
plt.title("Summer has more danceable songs")
plt.ylabel('% Increase in Danceability of Billboard Songs')
plt.axhline(0, linestyle='-', color='grey')
plt.text(2.55,0.1, 'Baseline: Mean Danceability', color='black', fontsize=9)