PodcastRE Analytics: Using RSS to Study the Cultures and
Norms of Podcasting Eric HoytUniversity of Wisconsin-Madisonehoyt@wisc.edu
Kahl Family Professor of Media Production in the Department of Communication Arts at
the University of Wisconsin-Madison. He is the Director of the Media History Digital Library and the
Wisconsin Center for Film and Theater
Research. His ongoing work on digital projects focuses on global film magazine
digitization, podcast preservation, and building networks for educational radio
history.
J.J. BerschUniversity of Wisconsin-Madisonjbersch@wisc.edu
JJ Bersch is a PhD Candidate in Film at the University of Wisconsin-Madison. He also
works in data curation and post-production at the Media History Digital Library. His
research is primarily interested in the industrial, cultural, and aesthetic
implications of crossover between the entertainment and advertising industries. He is
currently writing a dissertation entitled
Pack Your Product’s
Bags, It's Going Hollywood: Explaining the Mainstream Emergence of Cinematic Product
Placement in the 1980s.Susan NohUniversity of Wisconsin-Madisonsnoh8@wisc.edu
Susan Noh is teaching assistant in Media and Cultural studies in the Department of
Communication Arts as the University of Wisconsin-Madison.
Samuel HansenUniversity of Michigan and University of
Wisconsin-Madisonhansensm@umich.edu
Mathematics and Statistics librarian for the University of Michigan and a database
developer for the University of Wisconsin, Madison working on the Media History
Digital Library and PodcastRE. They also produce and host podcasts for ACMEScience,
including the podcast Relatively Prime which tells stories from the world of
mathematics.
Jacob MertensUniversity of Wisconsin-Madisonjmertens2@wisc.edu
PhD candidate in the Media and Cultural Studies program at the University of
Wisconsin-Madison. He studies digital distribution, video games, and translation and
localization practices. His dissertation explores the industrial trends and
affordances of the video game digital marketplace, tensions between video game
producers and their audiences, and the implications of digital revisionism.
Jeremy Wade MorrisUniversity of Wisconsin-Madisonjwmorris2@wisc.edu
Jeremy Morris is associate professor of Media and Cultural Studies in the Department
of Communication Arts at the University of Wisconsin-Madison. He is interested in the
digitization of culture and is author of
Selling Digital Music,
Formatting Culture and co-editor of Appified: Culture
in the Age of Apps. He is also the founder of podcastre.org, a research
database of podcasts that preserves over 2 million audio files.
Alliance of Digital Humanities OrganizationsAssociation for Computers and the Humanities000519015105 March 2021article
This is the source
DHQ classification scheme; full list available athttp://www.digitalhumanities.org/dhq/taxonomy.xmlKeywords supplied by author; no controlled vocabularyCreated file
Over the past decade, podcasting has grown into one of the most important media forms in
the world. This article argues that podcasting’s unique technical affordances —
particularly RSS feeds and user-entered metadata — open up productive methods for
exploring the cultural practices and meanings of the medium. We share three different
methods for studying RSS feeds and podcast metadata: 1) visualizing how topics and
keywords trend over time; 2) visualizing networks of commonly associated keywords entered
by podcasters; and 3) analyzing norms and common practices for the duration of podcasts
(as a time-based media format, podcasting is unusual in that it is not bound by the
programming schedules and technical limitations that provide strict parameters for most
audiovisual forms). The methods and preliminary results reveal how metadata can function
as a surrogate for studying large collections of time-based media objects. Yet our study
also shows that, when it comes to born digital media, the metadata are never fully
separate from the objects they describe. We argue that future work in AV in DH needs to
delineate between methods best suited for digitized media collections compared to those
most appropriate for born digital media collections.
We share three different methods for studying RSS feeds and podcast metadata: 1)
visualizing how topics and keywords trend over time; 2) visualizing networks of common
associated keywords entered by podcasters; and 3) analyzing norms and common practices for
the duration of podcasts (as a time-based media format, podcasting is unusual in that it
is not bound by the programming schedules and technical limitations that provide strict
parameters for most audiovisual forms).
Introduction
As a cultural form, podcasting resists easy definition. It is a highly porous medium,
traveling with us over earbuds, phone speakers, and car stereos, accompanying us on
commutes, jogs, errands, and road trips. It’s a sound-based media that we also experience
visually through live shows, thumbnail icons, and t-shirts that say Friend of the Pod or Night Vale Community College.
Despite these definitional challenges, the medium, by most measures, is booming — with the
quantity of podcasts, listeners, advertising revenue, and non-profit funding increasing
sharply year after year, including an explosive 2018, which saw the
number of U.S. people over the age of twelve who have ever listened to a podcast climb
above 50% for the first time .
If we consider podcasts from a purely technical standpoint, it is possible to narrow the
definition slightly. As we have elaborated elsewhere, a podcast can be defined as a
collection of downloadable files, of any format, served, with accompanying metadata, via
an open updatable internet feed, primarily RSS . An XML-based
protocol, RSS allows for podcasters to easily publish their completed work and distribute
it to audiences, who can opt to subscribe to particular feeds. In many ways, the metadata
and the open feed are what separate a podcast from other media files on the internet,
including other forms of on-demand audio (for example, music streaming platforms and audio
book companies). Because RSS feeds are open, podcasting is platform-independent. Listeners
can subscribe to feeds through a number of different podcatching apps and a variety of
platforms. At least for now.
Both the expansive cultural meanings of podcasting and the rigid technical definition
have animated our work over the past three years on the PodcastRE database and our desire
to study and preserve this emerging format. Based at the University of Wisconsin-Madison
and supported by grants from the university and the NEH, PodcastRE (short for Podcast
Research and accessible at http://podcastre.org)
is a data preservation and research initiative. As we write in April 2020, the PodcastRE
database has grown to over 2.5 million podcast episodes from over 16,000 unique RSS feeds
which occupy 99 terabytes of space within our RAID storage array. The collection has
expanded beyond what any individual could listen to within a lifetime, and it only keeps
growing.
What can researchers do with millions of podcast episodes and their associated metadata?
This article seeks to contribute to the body of digital humanities scholarship invested in
harnessing the affordances of digital technology to investigate cultural data at a large
scale . We argue that podcasting’s unique technical affordances
(e.g. RSS and metadata) open up productive methods for exploring the cultural practices
and meanings of the medium. These methods, in turn, hold broader relevance for scholars
seeking to integrate media studies with computational analysis (or, as the theme of this
special issue nicely puts it,
AV in DH). Our study of
podcasting shows the ways that metadata can function as a surrogate for studying large
collections of time-based media objects. To put it simply, it’s far easier to query 2
million metadata records than it is to query 2 million media files of movies, TV episodes,
or audio programs. Yet our study also shows that, when it comes to born digital media, the
metadata are never fully separate from the objects they describe, nor can they fully
describe, or replace the need for, returning to the media themselves during the final
analysis. As a result, future work in AV in DH needs to thoughtfully delineate between
methods best suited for digitized media collections compared to those most
appropriate for born digital media collections.
In this article, we share three different methods for analyzing the metadata of
PodcastRE’s born digital corpus, assessing the strengths and weaknesses of each method and
sharing preliminary results. First, as we will share, PodcastRE’s Term Frequency Line
Graph (http://podcastre.org/lineGraph)
allows researchers to create visualizations of trending topics and keywords. Interpreting
the results of the line graphs can be challenging, however, due to the messiness of the
underlying metadata and the problem of normalizing a rapidly growing
corpus and medium. Second, PodcastRE’s Associated Keyword Cloud visualization tool
(available at http://podcastre.org/wordCloud) enables researchers to query a keyword and
generate a word cloud that displays the other keywords that appear alongside that keyword
in podcasts. We argue that this data visualization harnesses the potential of the medium’s
inconsistent and messy metadata and allows for open-ended explorations, serendipitous
discovery, and new questions about the agency of podcasters in self-defining their
cultural output and connecting it with particular communities and conversations. Third and
finally, we share approaches for studying the duration of podcasts. As a time-based media
format, podcasting is unusual in that it is not bound by the programming schedules and
technical limitations that provide strict parameters for most audiovisual forms, such as
movies, television, and radio. If a podcast can run anywhere from a couple of seconds to
several hours in length, how do norms and common practices develop that establish optimal
models for a podcast’s duration? To investigate these questions, we exported CSVs from the
database (using a mediaFileDuration field generated by the individual
episode files), sorted them into meaningful sample groups, and investigated the data for
patterns.
Ultimately, our goal for this article is to share research and methods for studying the
explosion of audio culture taking place in podcasting and through the sonic communities
and conversations podcasting draws together. These methods are especially well suited for
studying audio, but they would also be valuable for exploring online video collections and
other digital media objects. As in any research study, though, it’s important to address
the specific before offering broader generalizations. With that in mind, we would like to
now turn to a consideration of the history and design of RSS feeds, how they inform the
underlying dataset (the PodcastRE collection), and the affordances and challenges of these
structures.
RSS Feeds, the PodcastRE Collection, and Working with Messy Metadata
As the protocol that has enabled inconsistent and idiosyncratic podcast metadata to
proliferate across the internet, it is fitting that there is no singular consensus on what
the initials RSS should stand for.
Real Simple
Syndication is the most commonly cited meaning. But Rich
Site Summary and Resource Description Framework (RDF) Site
Summary have also been cited as the basis of the name. There is no question,
however, that the technology has played a pivotal role in the growth of podcasting and
infrastructure of PodcastRE.
On March 15, 1999, Netscape published the first specification for RSS . Based on XML and developed by Ramanathan Guha and Dan Libby, RSS
was created so that the Netscape home page
My Netscape could
be refreshed with new content from webpages which used the specification . Over the next couple of years, RSS went through
multiple iterations, and then on December 25, 2000, Dave Winer and UserLand software
released RSS 0.92 . It was this version of RSS which is most
important in the history of podcasting as it was the first version which included the
<enclosure> tag, which allowed for the attachment of media files. Concocted by
Winer, with strong prompting from Adam Curry, as a way to deliver high quality multimedia
files over the internet without the quality and wait time issues which plagued early
streaming, the first use of an <enclosure> was to distribute a set of Grateful Dead
MP3 files , presaging its dominant use in the
years to come.
RSS continued to develop for the next decade and mostly stabilized as a specification
with version 2.0.11 on March 30, 2009 . Since then the only
updates to the format have come in the form of XML Namespaces, which are ways of adding
outside-of-specification elements to XML documents that are commonly used by Apple, Google
and other podcast distributors to expand metadata options for commercial purposes . Even as podcasting apps, playback technologies,
and the on-demand sound industry has changed throughout the 2010s, the basic structure and
syntax of RSS has remained constant, keeping the circulation of most podcasts relatively
open and freely downloadable.
The open infrastructure of RSS also became foundational for our work on PodcastRE . Interested in studying podcasts, but worried about the
vulnerability of digital audio files, we realized in early 2014 that there were few
searchable databases of podcasts for studying and analyzing the booming audio culture
taking place in podcasting. We began rather humbly by logging RSS feeds manually in iTunes
and downloading audio files to a local hard drive, tracking as best we could podcasts that
were being cited in the press as part of the renewed interest in podcasting, like
Serial or Welcome to Nightvale. As the project grew, we implemented a more coherent collection
process, and since 2018, we have been saving podcasts included in discussions of
podcasting’s golden age as well as interrogating what podcasts are
being left out of that discussion. We’ve navigated the need to preserve the
popular by automating the collection of a particular index of
what’s popular: the Apple Podcasts top 100 lists for the U.S., Great Britain, France, and
Australia every 24 hours. This automated approach toward collecting embraces both the
affordances of the digital media and the MPLP (More Product, Less Process) model proposed
by Mark A. Greene and Dennis Meissner . Our efforts to identify
and collect significant podcasts beyond the Apple Podcasts top 100 have been driven by
collaborations with scholars who are researching independent podcasts produced by women,
indigenous peoples, and people of color and by following the work of other networks, directories and
databases devoted to highlighting marginalized/less visible podcasts (Podcasts in Color,
Women In Podcasting, PotLuck Podcast Network, etc).
PodcastRE’s collection of 2 million podcast episodes has thus been built by a combination
of algorithmic methods and informed hand-selections. There’s also a submit a podcast feature on the project’s website that allows individuals to
add the RSS feeds for podcasts they’d like preserved. The 16,000 archived podcast feeds
are a fraction of the over 1,000,000 podcast feeds that, according to estimates, are
currently being distributed . But the PodcastRE collection
does offer a valuable and diverse cross-section of English-language podcasts from the past
several years.
The common thread through all of this work has been RSS. To put it simply, if a podcast
doesn’t have an RSS feed, then we cannot yet preserve it within our system. This is one of
the reasons why the technical definition of a podcast — an open feed of downloadable files
and associated metadata — has been so important to our work on PodcastRE. To achieve our
goals of scale, not only did we need to be able to download podcast episode files
automatically, we also needed to gather the metadata we could store automatically. For
PodcastRE, the elements available through the RSS specification, and its associated
namespaces, are as important as the podcast episode files themselves. RSS, in other words,
defines the possible universe of metadata for the podcasts archived in PodcastRE.
What we did not immediately appreciate was how messy, idiosyncratic, and incomplete the
world of podcasting metadata would prove to be. Podcast RSS metadata is a world away from
the familiar and relatively consistent metadata fields of TEI and Dublin Core. One reason
is the relatively sparse number of elements which are required for a feed to be valid. In
fact, an RSS feed only needs four elements to be present in order to be valid: the
<channel> parent element with associated <title>, <link>, and
<description> elements. This would be a feed without content though as it would
contain no <item>s . Because authors fully
manage their own RSS feeds, and the entry of the metadata into them, they are directly
responsible for the depth and quality of the metadata. This aspect of podcast metadata
cannot be stressed too highly. With the exception of a few elements like
<googleplay:category> and <itunes:type>, there are almost no constraints on
what podcast authors put into the various elements. Even fixed format elements like
<pubDate>, which seems rather self explanatorily to mean the date on which a podcast
episode was published into a feed, can end up being used by authors to mean something very
different. For example, there are many <pubDate>s before 1950 in the metadata for
The Reith Lectures podcast from the BBC, long before the term podcast was
ever coined. Instead, the series uses <pubDate> to mean the day the lecture was
originally given. RSS authors continue to have the authority to change anything they
wish — including something as fundamental as the title of an episode, or even their whole
podcast, at any time. Looking in PodcastRE, we see examples related to branding, as when
Bookworm added their network and became KCRW's Bookworm; or to
SEO, as when Highest Self Podcast added some terms and turned into
Highest Self Podcast: Modern Spirituality, Ayurveda, Conscious Entrepreneurship,
Mind-Body Balance.
The inconsistent and incomplete metadata records created major challenges for our efforts
to systematically preserve podcasts and make them easily searchable. We found it
especially unfortunate that metadata fields that could have been revelatory for search
faceting and social network analysis (fields such as <network>, <host>, and
<contributor>) are not a part of any current podcast RSS specifications. Yet it was
equally clear that authoritative approaches to metadata had their own problems and major
blind spots. The inadequacies and biases of Library of Congress subject headings have
received considerable attention within the discipline of information studies. For example,
Juliet L. Hardesty has argued that the subject headings
generally take the primacy of white men as a default;
Robert
Frost is cataloged under Poets, American without
reference to gender or race, whereas Maya Angelou is listed under subjects including
African American women authors and African American authors. The catalogers, in these cases, are applying a schema
that upholds a white patriarchal worldview and minimizes both the needs of users and the
ways in which creators and subjects would choose to define themselves.
In contrast, podcast creators have a tremendous amount of agency in how they define
themselves and attempt to connect with users (i.e. listeners and audiences). When the
creators of the PHX podcast entered the keywords podsincolor and
women of color within their RSS feed, they actively chose to
present themselves this way and place their work within a larger network of podcasts
produced by people of color. The flexibility that characterizes metadata practices prove
to be critical for marginalized podcasters in forming community, as they seek to carve out
space for themselves within media production practices and platforms that consistently
privilege hegemonic whiteness, accepted paradigms of masculinity, and heteronormativity.
While this does not necessarily mean that self-policing within metadata production does
not happen as a result of the asymmetrical power dynamics between platform and creator, it
does still provide yet another avenue in which marginalized communities can stand in
opposition to the individualistic neoliberal ideologies that undergird contemporary
user-driven media production . It is critical to note, however, that the non-uniformity of
metadata production yields ambivalent practices, both where innovative podcasters can
resist the influence of various dominant ideologies, while others use this space to
reinforce their centrality simultaneously.
For example, there are also many instances of podcasters stuffing their RSS with keywords
in order to make them prominent within content aggregators and
podcatchers. The internet abounds with advice and speculations for
search engine optimization and strategies that can be utilized in order to gain attention
to one’s content, such as the optimal number of keywords, the kinds of thumbnail images
that should be connected to content, and more . The opaque nature of how Apple Podcasts organizes its search
results impacts the manner in which metadata is written, and this influences the ways that
podcast creators self-define their own content. The dominating influence of Apple Podcasts
categories can be observed by the fact that within the entirety of the PodcastRE database,
the most used keywords lists are dominated by terms that are outlined either fully or in
part by Apple Podcasts genre specifications. For example, with the exception of the words,
podcast and radio, the top fifteen keywords
for the podcast classification (the entire podcast series) terms all cohere to various
genre classifications within Apple Podcasts. Similar patterns can be seen for the episodic
classification where, with the exception of
Talk Radio,Podcast and a blank space/uncategorized, the top ten keywords
reflect Apple Podcasts categories. The large amount of uncategorized keyword terms may
gesture towards the fact that after 2013, the keywords metadata field became deprecated,
meaning that it no longer affected the output of Apple Podcasts’ search engine algorithms
(https://support.libsyn.com/kb/the-rss-feed/). After this discovery, many
podcasters may have forgone the labor of adding keywords, as the fields that most
influence search engine optimization are now the title, author, and description tags.
Even though Apple Podcasts deprecated keywords within its search algorithm, we became
excited about the role keywords could play for our work on PodcastRE. What sort of data
visualizations and discoveries might be possible by harnessing RSS metadata at scale?
Ultimately, we built two data visualizations for the site. Perhaps not surprisingly, the
more successful of the two was the one that most embraced the idiosyncratic, messy, and
user-created nature of RSS.
Graphing Metadata Term Frequency Across Time
How do keywords and other fields used to describe podcasts change over time? Could
tracking these changes prove useful for spotting trending topics within the podcasting
ecosystem? To explore these questions, we created PodcastRE’s Term Frequency Line Graph
(publicly available at https://podcastre.org/lineGraph), which tracks the frequency across time that any
word or phrase within the metadata appears. The fields searched include the title,
creator, synopses, and keywords. A visualization graphing the term
money within PodcastRE is displayed below. If a user clicks on any
point within the graph, their browser opens up a new tab displaying all of the matching
podcast episodes from that month or year that contain a matching search term.
The Term Frequency Line Graph searches metadata included within individual podcast
episodes and across the entire feeds (for example, while NBA may be a
keyword that describes a podcast feed as a whole, China may be a
keyword that describes a topic discussed within one episode of the podcast). By default,
the X-axis of the graph is divided by years; however, users can toggle to a monthly scale.
This allows for researchers to see when certain topics or keyword phrases spike on a
seasonal cycle (for example, baseball consistently has an uptick
during the playoffs every October) versus more macro-scale trends that rise and fall over
a period of years.
When researchers use the Term Frequency Line Graph to look for trends across a span of
years, however, they quickly encounter an interpretive challenge: almost any term they
search will appear to dramatically increase in 2017 and 2018. This is because the
PodcastRE collection grew exponentially over those two years, a result of the growth in
the podcasting ecosystem as a whole and our own curatorial decision to automatically
preserve any feed that appears on the Apple Podcasts Top 100 chart in the U.S., U.K.,
Australia, or France. While we give users the ability to normalize
the graph results (which employs an equation to account for the larger number of podcasts
from some years compared than others), we know this feature has its limits. What does it
mean to normalize the number of podcasts during a period in which the
medium is rapidly evolving?
We have tried to address this interpretive challenge through a
Rate
of Episodes Added button, which provides contextualization in regard to the
database itself. By showing how many episodes are added per year, users can see how the
rate of growth in the database can affect the numbers that are being shown for any query’s
term frequency. Additionally, the Area Graph button
transforms the data into a stacked graph, which allows for comparisons across multiple
queries at particular moments in time and reminds users that the graphs are malleable and
dynamic. Finally, the user can move to a more granular level at any point by clicking on a
point in the graph, allowing them to investigate the actual podcast feeds and episodes
that appear as abstractions within the graph. Users can save the data to a CSV file, a
JPG, PNG or SVG vector image, so that this data can be applied to a variety of
presentational contexts.
In many ways, PodcastRE’s Term Frequency Line Graph exemplifies the limitations digital
humanists are likely to encounter when applying data visualizations built for
digitized text collections to born-digital media collections.
We modeled the user-experience and technological framework of PodcastRE’s Term Frequency
Line Graph on that of the Arclight app (http://projectarclight.org), which searches the 2.5 million page corpus of the
Media History Digital Library (MHDL) . The MHDL is composed of
books and magazines pertaining to the histories of film, broadcasting, and recorded sound
from 1915 to 1960, which is an especially robust period for the searching of named
entities (such as people, film titles, or radio station call letters). Additionally, the
normalization function for Arclight graphs works quite well (the most represented year of
1915-1960 is only double in size of the least represented year, avoiding PodcastRE’s
challenge of grappling with exponential growth). Normalized searches for the names of
movie stars, for example, generally map onto the arcs of their popularity and/or
notoriety, sometimes, though not always, with surprising results. Data visualizations
built for searching entities within large corpora of digitized texts are less adept at
producing immediately legible results for searching the metadata keywords of a rapidly
growing born-digital medium. What would it mean to design a data visualization tool that
embraced the messiness of born-digital objects and their metadata, rather than trying to
smooth them out?
Associated Keyword Word Cloud
In developing PodcastRE’s Associated Keyword Word Cloud, we sought to harness and
foreground the specificities and idiosyncrasies of born digital media collections. This
data visualization takes the keywords that podcasters entered to describe their work and
puts them into conversation with other podcasters’ keywords. A specific example is helpful
for understanding how it works.
Using the keyword money, in a search conducted in the fall of 2019,
we found the term appeared in the metadata of 68,619 podcast episodes saved within
PodcastRE, collected from 587 discrete RSS feeds. The other keywords that appear most
frequently along with money in podcast metadata are visualized below
(see Figure 2). This visualization includes predictable
matches within the popular financial self-help genre (e.g. wealth,business,entrepreneur), as well as meaningful intersections that lay outside
financially-oriented podcasts (e.g. spirituality,Relationships & Sex,Fear). When a user clicks on the keyword value in the cloud, the user
is immediately transferred to the PodcastRE database interface, where it shows all of the
podcasts that used these paired keyword values. Figure 3
reveals the results of the podcasts that contain both the keywords
money and spirituality. The process promotes
serendipitous discovery and may lead the researcher toward encounters they hadn’t
anticipated. For example, modern witchcraft is better represented in the podcasts with
money and spirituality as keywords than most
traditional forms of organized religion.
The Associated Keyword Cloud visualization was built through connecting together multiple
open source technologies. Like the Term Frequency Line Graph, the Associated Keyword Word
Cloud uses the Highcharts Javascript library to animate the visualization. To retrieve the
information it needs, we query the keyword metadata facet within PodcastRE’s Solr search
index, and we return and store them as key value pairs, with the number of podcasts that
maintain both the queried keyword and the additional keyword (the hit count) next to the
particular word. For example, if a user queries the word love within
the database, a potential key value pair that would appear would be [relationships, 163], where relationships would
be the associated keyword for love, and the 163
stands for how many times this keyword was added alongside the word
love. The results are sorted through keywords that have the most
hits down to the associated keywords that have the least hits. By targeting this metadata
keyword field and assigning the weight of a word to be the number
that is assigned to the hit count of the key value pair, we were able to visually
represent which keywords were paired most often with the queried word, by making the word
with the heaviest weight, the largest in the word cloud. Because
certain topics have a range of associated keywords that spanned hundreds of words, we
limited the number of keywords that can be shown on the word cloud to a maximum of 200
words. While this decision may hinder researchers from getting the full range of
associated keywords, this limitation was imposed to ensure readability on the
visualization. Two hundred words seemed like a reasonable count in order for researchers
to gain a sense of the wide range of relational topics that podcasters were dealing with,
and simultaneously allow the visualizations to be effective in showing which keywords were
the most actively engaged with.
There are two options on the Associated Keyword Word Cloud interface that aid in
isolating whether the keywords shown are related to podcasts in their entirety or
exclusive to certain episodes. Additionally, if users want a merging of these two levels
of metadata, they can search across both podcast and episode keywords by using the
All Keywords option. In this manner, for podcasts that may deal
with a wide range of topics, such as news or current events podcasts, there can be a
closer examination on a micro episodic level of what kinds of keywords are used to define
certain topical content. Often, the keywords that are used to describe podcasts are not
uniformly applied to define episodes, so providing these two levels of analytical range
gives researchers more flexibility in the kinds of questions they can ask using
PodcastRE.
All attempts to interpret the Associated Keyword Word Clouds ultimately lead back to
reflecting on the practices, norms, aspirations, and communities of the podcasters
themselves.
As discussed earlier, keywords allow content creators to define their work to listeners
and podcatcher applications. They are a space of creator agency, where podcast producers
deploy keywords to create networks of ambient affiliation with other podcasts and subject
matter. By making one of PodcastRE’s database visualization tools intimately connected to
these creator-defined keywords and their relationships to other keywords, we provide an
alternative mode of discoverability apart from the algorithms that govern commercial
aggregators such as Apple Podcasts. In this manner, PodcastRE hopes to provide a different
approach that foregrounds creator agency and their interactions with their own metadata
through the digital archive’s organization, particularly with these metadata
visualizations.
Studying the Durations of Podcasts
The Term Frequency Line Graph and Associated Keyword Word Cloud can both be effectively
applied toward exploratory research and achieving serendipitous discoveries. But we also
wanted to use PodcastRE and the mediaDuration field to examine a more
focused question. What patterns can we notice about the duration of podcasts, and what can
they tell us about practitioner norms and assumptions of what makes for a good length of a
podcast? Unlike most other AV forms — movies, television, and radio — podcasts are a
time-based medium that are not constrained by programming schedules (broadcast schedules,
movie theater showtimes) and technical limitations (reels of film and tape). If a podcast
could run anywhere from a couple of seconds to several hours in length, how do norms and
common practices develop around perceived ideas of a podcast’s optimal duration? We
realized that metadata could help us answer this question.
In this section, then, we propose and share two approaches to studying podcast duration.
First, we consider how duration analysis might clarify the differences between two
programs of the same specification classification, in this case two popular daily programs
from The New York Times and NPR, using data gathered from
episodes ranging from the former’s launch in early 2017 to an end point of April 2018.
Second, we conduct an investigation of a much larger scale, analyzing large rosters of
programs to juxtapose duration across networks and genres. Our case studies here are the
comedic programs of Earwolf and the comparably more serious fare of Gimlet Media, using
data gathered from episodes ranging from 2009 until early 2018. In both of these cases,
the statistics were gathered by first running an SQL query on the PodcastRE database, then
exporting metadata for all of the episodes into a .csv file, and finally finding averages,
medians, and other numbers using Microsoft Excel. All of these approaches required us to
assemble subsets of data from within the PodcastRE collection (and the .csv files), rather
than treating the entire collection as a dataset.It should be noted that the ability
to obtain duration data is not currently available to front-end users of the PodcastRE
site. The genre and network categories that we ourselves added to the
spreadsheets opened the data up for more meaningful analysis, especially when paired with
the duration metadata provided by the RSS feeds.
Our first approach to studying duration explored what has become one of the most popular
contemporary podcast formats: the daily news program. How long should a daily news podcast
take to consume? When The New York Times launched The Daily in
February of 2017, host Michael Barbaro described the fledgling program thusly: This is how the news should sound. Fifteen minutes a day, five days a
week. It isn’t quite a podcast — although you can listen wherever you listen to
podcasts. It isn’t quite the radio — although the mechanics are largely the same. It
isn’t quite the newspaper — although we’ll be drawing heavily on the journalism that
powers The New York Times. Though Barbaro pegged the program as difficult to explain,
it was a nearly immediate hit, gaining over five million monthly listeners by July of 2018
. As Barbaro told Vanity Fair that same month,
When we started the show, we had many goals… We didn’t realize we
were going to make money that was actually going to get pumped back into the
company. Yet as is often the case, success breeds imitators and
competitors, and The Daily witnessed the rise of its biggest challenger in
June of 2017 when NPR launched Up First, a daily 10-minute morning news podcast that is designed with
digital listeners in mind but will also serve as a preview of the news stories that will
be treated in depth on public radio stations across the country throughout the
day. That program was also a swift triumph, and as of October 2018,
both The Daily and Up First sat comfortably in the top five most
popular podcasts according to Podtrac’s rankings: the former tailed behind only
Serial, while the latter occupies the fifth spot .
Episode duration has been a central selling point for each of the two podcasts. As seen
above, both of the series’ launch press releases mention episode length. Descriptions of
the programs on their official websites also focus on duration. Up First has
remained consistent in its advertised average runtime: NPR’s Up
First is the news you need to start your day. The biggest stories and ideas —
from politics to pop culture — in 10 minutes. The Daily, meanwhile, has added five minutes to its
initial announcement: This is how the news should sound. Twenty
minutes a day, five days a week, hosted by Michael Barbaro and powered by New York Times
journalism. The programs are, essentially, two different approaches to the
morning commute: Up First’s proposed shorter length seems guaranteed to slot
into almost any daily trip to work, while The Daily’s longer runtime requires
either a lengthy commute, multiple listening sessions, or even perhaps the utilization of
1.5x or 2x speed playback options. Such duration decisions are complimented by
storytelling approaches: Up First’s short length is matched with a
greatest hits style compilation of short stories, while The
Daily’s relatively lengthier duration is primarily spent on the discussion of a
single story. In theory, then, the former aims to quickly provide its listeners with
headline-style blurbs about the day’s biggest stories, while the latter seeks to
exhaustively cover a single topic.
Such temporal differences are roughly borne out by the metadata found in PodcastRE’s
database, although the story is more complicated than the descriptions of the series
imply. The Daily (mean duration of 22:51, median duration of 22:12) runs
nearly ten minutes longer than Up First (mean duration of 13:33, median
duration of 13:17), with both programs on average running a few minutes longer than their
advertised lengths. The differences between the two series is much starker when
considering the range in podcast durations, as Up First is relatively
consistent in episode duration (shortest episode of 11:01 and longest episode of 17:46 for
a range of 6:45) while The Daily varies widely between episodes (shortest
episode of 13:00 and longest episode of 41:23 for a range of 28:23). These durational
differences align neatly with the programs’ content choices (i.e. multiple headlines vs.
single story focus), though they provide critical additional clarifications. While both
Up First and The Daily release episodes each weekday morning,
the former’s tight range and shorter length ties it more closely to its proposed function
as morning commute listening, while the latter’s wider range and extended runtime
emphasizes delivering a full story adequately. Since podcasts do not have the same
durational constraints of broadcast media, these choices in runtime are clear aesthetic
and storytelling decisions – yet given the evolutionary radio approach of NPR’s daily
podcast and The New York Times’s commitment to the news story, these
decisions are not completely detached from their companies’ original mediums.
On a larger scale, podcast duration analysis can point towards divergent approaches by
podcast networks and in certain genres. As an example, we conducted an analysis of thirty
Earwolf programsAffirmation Nation, Analyze Phish,
Andy Daly Podcast Pilot Project, The Apple Sisters,
Bitch Sesh, Comedy Bang! Bang!, Eardrop,
Earwolf Challenge, Fogelnest Files, Glitter in the
Garbage, Hard Nation, Hello From The Magic Tavern,
Hollywood Handbook, How Did This Get Made,
Improv4Humans, Kevin Pollak’s Chat Show, Mike
Detective, Never Not Funny, Off Book, Professor
Blastoff, Rafflecast, Ronna and Beverly,
Spontaneanation, Throwing Shade, Topics,
Totally Laime, U Talkin’ U2 2 Me, Who Charted,
With Special Guest Lauren Lapkus, and Womp It Up! and
nineteen Gimlet Media programs.Chompers, Crimetown,
Every Little Thing, The Habitat, Heavyweight,
Homecoming, Mogul, Mystery Show, The
Nod, The Pitch, Reply All, Sampler,
Sandra, Science Vs., StartUp,
Surprisingly Awesome, Twice Removed, Uncivil,
and Undone. The former describes itself as the
leading comedy podcast network devoted to creating the best, funniest, and most
entertaining podcast shows in existence. Gimlet Media specializes in more
serious fare, characterizing itself as the
award-winning narrative podcasting company that aims to help listeners better understand
the world and each other. Though both companies employ personnel who have worked or
continue to work in other mediums, Earwolf and Gimlet distinguish themselves from other
major podcast networks such as
NPR, iHeartRadio, and WNYC Studios through their
podcast nativism: both companies began as strictly podcast-focused networks rather than
emerging within older media companies.
Perhaps as a result, the two networks have markedly different approaches towards podcast
episode length. Of the 30 surveyed Earwolf programs, 3 have average runtimes between 0-20
minutes, 4 have average runtimes between 20-40 minutes, 6 have average runtimes between
40-60 minutes, 10 have average runtimes between 60-80 minutes, 3 have average runtimes
between 80-100 minutes, 3 have average runtimes between 100-120 minutes, and 1 has an
average runtime between 120-140 minutes (See Figure 4). This
means that over half of the surveyed programs have average episode durations over an hour,
with programs ranging from Eardrop’s average runtime of 3:17 and Never
Not Funny’s average runtime of 2:05:33. The shortest single episode was a
38-second Eardrop episode, while the longest individual episode was a
Comedy Bang! Bang! that lasted 3:19:02. Earwolf’s individual shows also
frequently feature drastic ranges in shortest and longest duration: Hollywood
Handbook, for instance, has a range of 1:53:27 between its shortest and longest
episodes, while Comedy Bang! Bang!’s range is 2:39:06.
Gimlet, on the other hand, is much more consistent in its runtimes across series, though
there is still variation between individual episodes. Of the 19 surveyed programs, only 1
had an average runtime between 0-20 minutes, and that program (Chompers)
serves a specific and brief function: children are meant to listen to the series as they
brush their teeth. 2 of the series had an average runtime between 40-60 minutes, though
both of those shows (Twice Removed and Mystery Show) are no
longer producing episodes. The other 17 series, then, had average run times between 20-40
minutes, aligning Gimlet’s roster with conventional advice on podcast episode
length.Though most blogs on the subject recommend tying duration to whatever
length your content demands, they also routinely recommend shorter average durations,
with We Edit Podcasts, for instance, writing, it is
possible to become successful with a longer show, but in general, the 22 minute rule
trumps all. While individual episodes still varied quite a bit
(Reply All, for instance, had a range of 1:35:29 between its shortest and
longest episodes, while Mogul had a range of 1:16:17), these ranges were
still much smaller than the largest Earwolf ranges.
The relative homogenization of Gimlet Media podcast duration, then, stands in stark
contrast to the diverse podcast lengths of Earwolf. Every Gimlet Media podcast had an
average run time under an hour, and 84.2% of the shows surveyed had average runtimes
between 20-40 minutes. 56.7% of the Earwolf podcasts analyzed, meanwhile, had average
runtimes over an hour long, and 76.7% of Earwolf’s podcasts had average runtimes over 40
minutes long — in other words, over three-quarters of Earwolf’s shows ran longer on
average than Gimlet’s sweet spot. Individual episode lengths varied
in each of the networks’ programs, but Gimlet’s programs featured smaller ranges than the
large variation found in many of Earwolf’s programs. Such differences may be the result of
institutional decisions, generic divergences, or series lengths. Whatever the cause,
however, Earwolf and Gimlet serve as evidence that podcast networks can have wildly
divergent approaches towards episode duration, and that studying duration can lead us to
insights about genre conventions, production values and more.
On a recent episode of the
Start Up podcast, the show’s
host, and Gimlet CEO, Alex Blumberg was reflecting on his decision to sell Gimlet media to
Spotify. He noted that Gimlet’s gambit to standardize the production of highly edited and
tightly produced 'quality’ podcasts (that often followed very specific duration and other
editorial decisions) had turned out to be a financially unfeasible strategy that was
losing ground to cheaper and more popular chat cast style podcasts (where duration and
other attributes are more flexible given the lower costs involved for editing and
polishing the finished piece). His comments are a reminder that, despite the format’s
substantial growth in the last two decades, there are still many lingering questions about
the forms, conventions and economics of podcasting. We believe it is especially crucial
during this time of flux, before podcasting stabilizes like so many other media have, to
study the different approaches podcasters of all types are taking as they experiment with
this emerging sonic format. Although duration numbers seem like relatively innocuous or
descriptive metadata, the research from PodcastRE suggests they reveal historical
relationships between new and old media formats, industrial and economic assumptions about
ideal formats, and generic conventions that shape both amateur and
professional podcasts.
Conclusion
Our work on PodcastRE has aimed to provide tools and data that account for podcasting’s
complexity as a cultural form while simultaneously taking advantage of its unique
technical affordances. The centering of RSS metadata and what can be mined from it through
advanced search, graphing keywords over time, or visualizing word clouds of associated
keywords has helped us facilitate the automated collection of a significant corpus of
podcasts from a crucial period in the format’s emergence. It has also facilitated novel,
fine-grained exploration of specific file characteristics as well, like duration metrics,
across a variety of genres and shows.
The reliance on RSS, however, has also forced us to confront the messiness and
intricacies of a born digital object whose metadata and descriptive features are dynamic
and podcaster generated. Podcasting’s relatively open and accessible origins have helped
create a vibrant environment for web-based audio – one that includes the scores of
podcasts available and the multiplicity of voices behind them, but also the numerous apps,
aggregator sites and distribution technologies that have emerged to support podcasting’s
rise. RSS and XML have not only been important to our work on PodcastRE, but to podcasting
more broadly, and to the agency and control it has provided podcasters for defining their
work on their own terms as well as for listeners in terms of defining their listening
practices. Recently, there have been a number of attempts to move away from the more open
and accessible versions of podcasting, to more closed and profitable models (e.g.
exclusive shows tied to one platform, like Spotify, or subscription-based services like
Luminary). While these options may make podcasting more user-friendly and convenient, or
may offer podcasters more options for monetizing their work, they also make podcasts more
platform-dependent, less analyzable, and less open to research.
The centrality of RSS to both podcasting and PodcastRE has been a theme throughout this
article. We believe our methods and findings, however, hold relevance for beyond scholars
researching other topics at the intersection of media studies and DH. As this study has
shown, metadata records can serve as surrogates for studying large collections of
time-based media objects, allowing researchers to query the durations of millions of media
objects in a fraction of the time it would take to ingest and analyze transcoded media
files. Yet our work has also shown that, when it comes to born digital media, the metadata
are never fully separate from the objects they describe, nor are they fully capable of
replacing close listening and other media studies methods. There is a need to delineate
between methods best suited for digitized media collections compared to those
most appropriate for born digital media collections and for devising
strategies to blend AV and DH methods. By making these distinctions, we can better apply
DH to AV and identify change and continuity, at a large scale, across media history.
Adams, D. After 'Serial,’
Sponsors Pour Money into Podcasts,The Boston Globe (2015): https://www.bostonglobe.com/business/2015/02/13/after-serial-sponsors-pour-money-into-podcasts/OKAzhUWtqCHQbl3IuEIiBN/story.html,
accessed November 26, 2019.Barbaro, M. Get Ready
for The Daily, Your Audio News Report,The New York Times (2017): https://www.nytimes.com/2017/01/30/podcasts/the-daily-get-ready-for-the-daily-your-audio-news-report.html,
accessed November 30, 2018.Bergen, M. Google Brings
Podcasting to Play Music, Swinging at Apple’s Dominance, Recode (2015).Recode.net, https://www.recode.net/2015/10/27/11620066/google-brings-podcasting-to-playmusic-swinging-at-apples-dominance,
accessed February 23, 2019.Bray, T., Hollander, D., Layman, A., Tobin,
R., & Thompson, H. S. Namespaces in XML 1.0 (Third
Edition), W3 (2009): https://www.w3.org/TR/xml-names/, accessed February 23, 2019.Clement, T. E. Towards a Rationale of Audio-TextDigital Humanities Quarterly, 10.2 (2016).Clement, T. E. When
Texts of Study Are Audio Files: Digital Tools for Sound Studies in DH. In S.
Schreibman, R. Siemens, and J. Unsworth (eds), A New Companion to
Digital Humanities, Chichester ; Malden, MA: John Wiley & Sons, Ltd.,
Chichester (2016): 348-57.Crowe, Anne. 101 Quick
& Actionable SEO Tips That Are HUGE.Search Engine Journal. October 21, 2017. https://www.searchenginejournal.com/101-quick-seo-tips/180563/.Earwolf, About
Earwolf, Earwolf (2018): https://www.earwolf.com/about/, accessed November 30, 2018.Edison Research, The
Infinite Dial 2019, Edison Research (2019): https://www.edisonresearch.com/infinite-dial-2019/.Florini, S. This Week
in Blackness, the George Zimmerman acquittal, and the production of a networked
collective identity.New Media & Society 19.3 (2017): 439-454.Florini, S. and Barner, B. 'I’m Trying to Be the Rap Oprah’: Combat Jack and the History of the
Loudspeaker Network. In J.W. Morris and E. Hoyt (eds), Saving New Sounds: Dispatches from the PodcastRE Project, University of
Michigan Press, Ann Arbor (forthcoming 2020).Gimlet Media, About, Gimlet Media (2018): https://www.gimletmedia.com/about, accessed November 30, 2018.Greene, M. and Meissner, D. More Product, Less Process: Revamping Traditional Archival
Processing,The American Archivist, 68.2 (2005): 208–63.Hansen, S. The Feed is
the Thing: How RSS Defined PodcastRE and Why Podcasts May Need to Move On. In
J.W. Morris and E. Hoyt (eds), Saving New Sounds: Dispatches from the
PodcastRE Project, University of Michigan Press, Ann Arbor (forthcoming
2020).Hardesty, J. Bias and
Inclusivity in Metadata: Awareness and Approaches. Indiana University Digital
Collection Services.Hines, M. Netscape Broadens
Portal Content Strategy, Newsbytes. (1999): http://link.galegroup.com/apps/doc/A54120248/ITOF?u=umuser&sid=ITOF&xid=377f45.Hogan, M. Dykes on Mykes:
Podcasting and the Activist Archive.TOPIA: Canadian Journal of Cultural Studies 20 (2008):
199-215.Hoyt, E., Hughes, K., and Acland, C.R. A Guide to the Arclight Guidebook. In C.R. Acland and E. Hoyt
(eds), The Arclight Guidebook to Media History and the Digital
Humanities, REFRAME/Project, Falmer (2016): pp. 1-29.Jerde, S. How NYT’s The
Daily Grew to 5 Million Monthly Listeners and Became a Breakout Star, Ad Week
(2018): https://www.adweek.com/digital/how-nyts-the-daily-grew-to-5-million-monthly-listeners-and-became-a-breakout-star/,
accessed November 30, 2018.Jockers, M. L. Macroanalysis: Digital methods and literary history. University of Illinois
Press, Champaign (2013).Morris, J. W., Hansen, S., & Hoyt, E.
The PodcastRE Project: Curating and Preserving Podcasts (and Their
Data)Journal of Radio & Audio Media, 26.1 (2019).Podcast Insights. 2020
Podcast Stats & Facts (New Research From Apr 2020), Podcast Insights: https://www.podcastinsights.com/podcast-statistics/. 10 SEO Tips For Your
Podcast. Podcast Motor. September 22, 2015. https://www.podcastmotor.com/seo-tips-podcast/.PodNews. The Total
Number of Available Podcasts Is Now 700,000, PodNews (2019): https://podnews.net/update/700000.Podtrac, Podcast
Industry Audience Rankings, Podtrac (2018): http://analytics.podtrac.com/industry-rankings/, accessed November 30,
2018.Pompeo, J. 'We Didn’t
Expect to Make Money’: How The Daily’s Michael Barbaro
Unexpectedly Became the Ira Glass of The New York
Times,Vanity Fair (2018): https://www.vanityfair.com/news/2018/07/how-the-daily-michael-barbaro-became-the-ira-glass-of-new-york-times,
accessed November 30, 2018.RSS Advisory Board. RSS History. RSS Board (n.d.): http://www.rssboard.org/rss-history,
accessed February 23, 2019.The Daily,The New York Times (2018): https://www.nytimes.com/column/the-daily, accessed November 30, 2018.Underwood, T. Distant Horizons: Digital Evidence and Literary Change. University of Chicago
Press, Chicago (2019).Up First: The Essential Morning
News Podcast From NPR,NPR (2017): https://www.npr.org/about-npr/522211062/up-first-the-essential-morning-news-podcast-from-npr,
accessed November 30, 2018.Up First,NPR (2018): https://www.npr.org/podcasts/510318/up-first, accessed November 30, 2018.Wang, J. H. The Perils of
Ladycasting: Podcasting, Gender, and Alternative Production Cultures. In J.W.
Morris and E. Hoyt (eds), Saving New Sounds: Dispatches from the
PodcastRE Project, University of Michigan Press, Ann Arbor (forthcoming
2020).We Edit Podcasts, What Is the Optimal Length for a Podcast? We Edit Podcasts (2016): https://www.weeditpodcasts.com/what-is-the-optimal-length-for-a-podcast/n,
accessed November 30, 2018.Winer, D. Payloads for
RSS. (2001): https://web.archive.org/web/20080214205403/http://www.thetwowayweb.com/payloadsforrss,
accessed February 23, 2019.Winer, D. RSS 2.0
Specification, RSS 2.0 at Harvard Law (2015): https://cyber.harvard.edu/rss/rss.html, accessed February 23, 2019.