On next week’s episode of the ‘Are You Entertained?’ podcast, we’re going to be analyzing the latest generation’s guilty pleasure- the music of the ’00s. Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?

Project Summary

For this project, I imported and cleaned the data, and computed:

  • The most popular artists in 2000
  • The #1 songs in songs in 2000
  • The time it takes for songs to reach #1, compared to the entire list

Importing Data

First we’ll look at the data to see what features are available and to see which type conversions might be necessary.

year                 int64
artist.inverted     object
track               object
time                object
genre               object
date.entered        object
date.peaked         object
x1st.week            int64
x2nd.week          float64
x3rd.week          float64
x4th.week          float64
x5th.week          float64
x6th.week          float64
x7th.week          float64
x8th.week          float64
dtype: object
   year      artist.inverted                                  track  time  \
0  2000      Destiny's Child               Independent Women Part I  3:38   
1  2000              Santana                           Maria, Maria  4:18   
2  2000        Savage Garden                     I Knew I Loved You  4:07   
3  2000              Madonna                                  Music  3:45   
4  2000  Aguilera, Christina  Come On Over Baby (All I Want Is You)  3:38   

  genre date.entered date.peaked  x1st.week  x2nd.week  x3rd.week     ...      \
0  Rock   2000-09-23  2000-11-18         78       63.0       49.0     ...       
1  Rock   2000-02-12  2000-04-08         15        8.0        6.0     ...       
2  Rock   1999-10-23  2000-01-29         71       48.0       43.0     ...       
3  Rock   2000-08-12  2000-09-16         41       23.0       18.0     ...       
4  Rock   2000-08-05  2000-10-14         57       47.0       45.0     ...       

   x67th.week  x68th.week  x69th.week  x70th.week  x71st.week  x72nd.week  \
0         NaN         NaN         NaN         NaN         NaN         NaN   
1         NaN         NaN         NaN         NaN         NaN         NaN   
2         NaN         NaN         NaN         NaN         NaN         NaN   
3         NaN         NaN         NaN         NaN         NaN         NaN   
4         NaN         NaN         NaN         NaN         NaN         NaN   

   x73rd.week  x74th.week  x75th.week  x76th.week  
0         NaN         NaN         NaN         NaN  
1         NaN         NaN         NaN         NaN  
2         NaN         NaN         NaN         NaN  
3         NaN         NaN         NaN         NaN  
4         NaN         NaN         NaN         NaN  

[5 rows x 83 columns]

Preparing Data

As a matter of preference, let’s rearrange the artist names column so that it reads first_name last_name, and rename the column to something more intuitive. Also, we’ll want to convert date.entered and date.peaked so we can perform calculations on those values. Take a look at the datetime columns to see how they’re written (which could affect the arguments in pd.to_datetime()):

data[['date.entered', 'date.peaked']] = data[['date.entered', 'date.peaked']].apply(pd.to_datetime, yearfirst = True, errors = 'coerce')

print(data[['date.entered', 'date.peaked']].head().dtypes)
date.entered    datetime64[ns]
date.peaked     datetime64[ns]
dtype: object

Analysis

First, which artists had the most songs on the billboard?

songs_on_the_billboard = data['artist_name'].value_counts()
songs_on_the_billboard[0:10]
Jay-Z                  5
Whitney Houston        4
The Dixie Chicks       4
The Backstreet Boys    3
Sisqo                  3
Toni Braxton           3
Britney Spears         3
LeAnn Rimes            3
Kelly Price            3
SheDaisy               3
Name: artist_name, dtype: int64

It appears that Jay-Z was the most popular overall, but let’s calculate his z score and p value to be sure that it’s not from random chance:

z = (5 - songs_on_the_billboard.mean()) / songs_on_the_billboard.std()
z
5.087836432701593

Jay-Z’s z-score of $5\sigma$ means that $p < 0.0001$, which means his song count is statistically significant.

Songs that made it to #1

# Select the columns that represent the weekly scoring, then select the rows that held the #1 spot
held_top_spot = data[data.iloc[:, 7:].apply(min, axis = 1) == 1]

print(held_top_spot[['track', 'artist_name']])

held_top_spot['genre'].value_counts()
                                    track         artist_name
0                Independent Women Part I     Destiny's Child
1                            Maria, Maria             Santana
2                      I Knew I Loved You       Savage Garden
3                                   Music             Madonna
4   Come On Over Baby (All I Want Is You)  Christina Aguilera
5                   Doesn't Really Matter               Janet
6                             Say My Name     Destiny's Child
7                             Be With You    Enrique Iglesias
8                              Incomplete               Sisqo
9                                  Amazed            Lonestar
10                       It's Gonna Be Me              N'Sync
11                      What A Girl Wants  Christina Aguilera
12                    Everything You Want    Vertical Horizon
13                    With Arms Wide Open               Creed
14                              Try Again             Aaliyah
15                                   Bent     matchbox twenty
16                  Thank God I Found You        Mariah Carey


Rock       15
Country     1
Latin       1
Name: genre, dtype: int64

However the genre data needs to be manually corrected since some songs are miscategorized. For example:

track artist_name genre
17 Breathe Faith Hill Rap
190 Let's Make Love Faith Hill Rap

Time to top

How long does it take for the top songs to reach the #1 spot?

# Time to top refers to the songs that reached #1
time_to_top = (held_top_spot['date.peaked'] - held_top_spot['date.entered']).dt.days
time_to_top.describe()
count     17.000000
mean      94.705882
std       60.678213
min       35.000000
25%       56.000000
50%       84.000000
75%       91.000000
max      273.000000
dtype: float64

Let’s compare this to the entire dataset:

# Time to peak calculates over all songs
time_to_peak_all = (data['date.peaked'] - data['date.entered']).dt.days
time_to_peak_all.describe()
count    317.000000
mean      52.246057
std       40.867601
min        0.000000
25%       21.000000
50%       49.000000
75%       70.000000
max      315.000000
dtype: float64

Do these data have statistically different means? We can try performing Welch’s t-test, but first we need to see if the datasets have normal distributions.

sns.distplot(time_to_top, norm_hist = True, bins = 15, label = 'Songs that reached #1', kde = False)
sns.distplot(time_to_peak_all, norm_hist = True, bins = 20, label = 'All songs', kde = False)

# Plot formatting
sns.set_palette(sns.color_palette(palette = 'hls', n_colors = 2))
plt.xlim(xmin = 0)
plt.legend()

sns.plt.show()

png

Neither data set appears to be normally distributed, as they are both positively skewed. Therefore we cannot technically apply a two-sample t-test to determine if the mean values are statistically separate. But when we run Welch’s t-test, we find:

peak_vs_top_t_statistic = scipy.stats.ttest_ind(time_to_top, time_to_peak_all, equal_var = False)
print("t-statistic: {}\np-value: {}".format(peak_vs_top_t_statistic[0].round(3), peak_vs_top_t_statistic[1].round(3)))
t-statistic: 2.851
p-value: 0.011

This would have let us reject the null hypothesis and state that the songs that reached #1 are statistically different from other songs.

Songs that stayed the longest time on the billboard

# Find the songs that spent the most time on the Billboard
number_of_weeks = data.loc[:, 'x1st.week':'x76th.week'].count(axis = 1)
longest_stay_dataframe = data.merge(pd.DataFrame(data = number_of_weeks, columns = ['weeks_on_billboard']), left_index=True, right_index=True).sort_values(by = 'weeks_on_billboard', ascending = False)
#longest_stay_dataframe['weeks_on_billboard'] = number_of_weeks

# Sort by the number of weeks on the billboard
top_10_longest_stay_dataframe = longest_stay_dataframe[0:10]

top_10_longest_stay_dataframe[['track', 'artist_name', 'weeks_on_billboard']]
track artist_name weeks_on_billboard
46 Higher Creed 57
9 Amazed Lonestar 55
24 Kryptonite 3 Doors Down 53
17 Breathe Faith Hill 53
13 With Arms Wide Open Creed 47
28 I Wanna Know Joe 44
12 Everything You Want Vertical Horizon 41
15 Bent matchbox twenty 39
20 He Wasn't Man Enough Toni Braxton 37
47 (Hot S**t) Country Grammar Nelly 34

Plotting how the top songs and the longest-staying songs moved

Let’s see how songs move on the billboard. We’ll look at the songs that reached #1 first, and then the songs that stayed on the billboard the longest.

png

png

Updated: