Project | Davide Romano

Spotify Podcasts

Religion podcasts network showing betweeneess centrality of nodes, clusters are colors

Key Takeaways

Exploration of an extensive dataset of podcasts transcripts through text network analysis that resulted in new inspiring hypotheses for further research.

Experimentation of a method that could be used for topic modeling, community detection, document clustering and visualization.

How to carefully interpret the data by taking into consideration aspects about algorithm design (e.g parameters choice), social, linguistic and thematic aspects.

This project was done for the course DH-500 Computational Social Media of the professor Gatica-Perez Daniel.

Motivation

Our dataset contains transcripts of an extensive group of podcasts episodes belonging to a wide range of genres and categories.

We wanted to investigate the ways in which different categories differ linguistically, how the podcasts are interrelated in terms of their vocabulary usage and how diverse it is in different categories of podcasts.

To answer these questions, we decided to use a particular tool called Text Network Analysis.

Data

Our dataset contains more than 100k podcasts that were sampled randomly from different categories. We manually mapped 100+ categories into 21 broader ones.

Methods

Our method is based on the library Textnets that performs text network analysis on a corpus of text. A network-based strategy for automated text analysis has many benefits.

Understanding patterns of connections between words helps to define their meaning more precisely than "bag of words" techniques, just as clusters of social relationships can explain a variety of outcomes, such as friendships, affiliations, or other types of social relationships.

We also determine clusters of documents through the Louvain community detection algorithm and each podcast will be assigned to one of the clusters.

Furthermore, we took the top 10 words with the highest TF-IDF frequencies within each cluster. Measures such as closeness centrality and betweenness centrality are also used to assess node importance, network connectivity, and community structure.

Results

We considered the presence or not of clusters in the network and the connectivity between clusters as proxies for interpreting linguistic and stylistic content diversity within the podcasts. But an important question to think upon was: is the creation of the clusters is due to a thematic or a linguistic difference?

What we've done was to qualitatively look at the top 10 TF-IDF words and podcasts names for each cluster. From this qualitative analysis, we were able to formulate a potential answer to this question. The resulting networks were vastly different for each category.

They were falling into two main categories: "Separated clusters" networks, those that have weak connections between them and are clearly separated, and "Mixed clusters", which are highly interconnected and hardly separable.

One example of a Separated clusters is the Religion & Spirituality category while Business category is an example for Mixed clusters.

By looking at the top words we can qualitatively determine that the clusters are defined based on a thematic - and consequently linguistically - difference: Spirituality, Judaism and Christianity.

In the case of Business the subtopics observed by the wordclouds are defined and coherent, however the network is highly interconnected.

This can bring us to two possible hypothesis: all business podcasts talk about a wide range of "subtopics". The second hypothesis suggests that the business domain employs a range of common words that are used across different situations and subtopics.

Is therefore "business language" always the same?

These are just two examples of possible interpretations of the results. There are many other networks that could be further explored like the one we created by putting together different categories in one network.

Network with multiple categories

Insights & Challenges

Data mining as inspiration: Our results could lead us everywhere and nowhere. What we got was inspiration, and that's what I think data mining can do. This was the first step towards the deep analysis of this huge dataset. We started with some hypothesis but we ended with more.

Algorithm/parameter design matters: Correct interpretation of the results are strongly affected by the small design decisions made throughout the process. Our approach to sampling episodes, selecting those most frequent within each category, directly influenced our findings. Similarly, minor parameter adjustments determined the visibility of certain edges over others, underscoring the impact of these choices on our results.

Team: Davide Romano, Cindy Tang, Mariella Daghfal

Author: Davide Romano