<aside> 💡 Project Description: Last.fm, ****a music streaming platform, has tasked our team with curating a line-up of artists for their upcoming 2-day music festival using their users’ listening data. By conducting degree distribution analysis, we discovered the top artists Last.fm’s users would most enjoy. We selected those artists as our ideal headliners for Last.fm’s festival. Additionally, by analyzing attributes of listening communities and determining the features of existing artists’ audiences that overlap with features of new artists, we recommended new artists that are likely to share an audience or fanbase with popular existing artists. Thus, together, we curated a setlist of artists who would perfectly go along with our headliners to create the perfect music festival line-up for Last.fm users.
</aside>
Our team worked on this project as part of Northwestern’s Social Networks Analytics course final project. For this project, our team was tasked with partnering with an “organization” of our choice to improve their decision-making framework. The partnerned organization chosen was “imaginary”: using publicly available data, we created an imaginary problem faced by the organization and presented solutions to address the imaginary problem using social network analysis tool. Our team chose Last.fm, a music streaming platform, that is looking to host a 2-day music festival - “Last.fm Music Festival” - in Chicago, IL.
Last.fm is a music streaming platform looking to host a music festival in Chicago, IL. The festival will be a 2-day long event, and each artist invited will perform on one of the two days. Our team’s goal is to help Last.fm construct a line-up of headlining artists for the music festival based on their users’ listening data in order to invite artists that their current users love, and to analyze patterns between users’ listening data in order to include new artists that users are likely to also enjoy.
User listening data: We obtained a dataset containing 360,000 randomly selected Last.fm users and a list of their top-ten-played artists from Pompeu Fabra University’s Music Technology Group. The dataset also contained the user’s attributes such as sex, age, location and date of joining the platform.
🧼 Data Cleaning: Data from users who were under 18, over 70 and outside the US were removed. Music festivals typically have an 18+ age limit and we believe that it would be difficult for users outside the US or over 70 years old to attend given the recent COVID-19 pandemic.
The top 150 artists who had the most number of users listing them as one of their top-ten artists were extracted. We chose 150 artists because a similar Chicago-based musical festival, Lollapalooza, has approximately 150 artists as part of their line-up. From this list, we then removed artists that were either inactive or dead, leaving us with 116 total artists.
Artist attributes data: Using Selenium on Python, we web-scraped Wikipedia in order to generate a dataset of artist attributes. This dataset consists of the 116 artists, their genre, sex, year they started making music and whether the artist is a band or a solo artist.
Using statnet and igraph packages on R, we created an network of artist relations. Each node in this network represents an artist. We then define an edge to exist between two nodes if there is a significant overlap in audiences for the two associated artists. We quantified audience overlap by using Jaccard similarity coefficient. If two artists have a Jaccard similarity coefficient of over 0.1, we conclude that there exists a tie between them. The table below illustrates the summary statistics of the artist network:
The figure below illustrates the artist network. The size of the node represents the node’s popularity, which we computed as the number of users that named the artist as one of their top-ten artists. The most popular artists included Radiohead, Kanye West, Coldplay and Death Cab for Cutie. The color of the node represents the artist’s main genre. There are 5 different genres: pop, rock, punk, indie and hip hop.
Takeaway: From the figure, we can see that there is a relatively clear divide between genres. This indicates that, if a user listens to a specific genre, they typically listen to multiple artists in that genre.