Patrick Jähnichen is a postdoctoral researcher at the machine learning group at Humboldt-Universität zu Berlin. He received his PhD from Leipzig University in 2016 for his dissertation on modeling topics dynamically over time. He graduated with an MSc degree in computer science from Leipzig University after having received a bachelor’s degree from the University of Cooperative Education in Stuttgart. His main research interests are Bayesian mixture models and their dynamics applied to natural language texts, stochastic processes to steer the dynamics, and statistical inference in these models. At the time of writing, the author was affiliated with the Natural Language Processing Group, Leipzig University, Germany.
Patrick Oesterling received his Master's degree in Computer Science in 2009 from the University of Leipzig, Germany. In 2016 he received his PhD from the Department of Computer Science at the University of Leipzig, where his research focused on computer graphics, information visualization and visual analytics.
Gerhard Heyer has studied at Cambridge University and the University of the Ruhr, where he received his Ph.D. After research on AI based natural language processing at the University of Michigan he has worked in industry for several years. He holds the chair on Natural Language Processing at the computer science department of the University of Leipzig. His field of research is focused on automatic semantic processing of natural language text with applications in the area of information retrieval and search as well as knowledge management. He is a member of the IEEE Computer Society.
Tom Liebmann received his Master's degree in Computer Science in 2014 from Leipzig University, Germany. He currently is a scientific employee at the same institution with focus of reasearch on the analysis and visualization of the topology of uncertain scalar fields.
Gerik Scheuermann received the master’s degree in mathematics in 1995 and the PhD degree in Computer Science in 1999, both from the Technical University of Kaiserslautern. He is a full professor at the University of Leipzig since 2004. He is a co-author of more than 120 reviewed book chapters, journal or conference papers. His current research interests focus on visualization with a focus on topology-based methods, flow visualization, visualization for life sciences, and visualization of text collections. He has served as paper co-chair for Eurovis 2008, IEEE Visualization 2011, IEEE Visualization 2012 and as General Chair of Eurovis 2013.
Christoph Kuras received his M.Sc. degree in Business Information Systems from the University of Leipzig in 2013. Currently, he is a researcher in the Natural Language Processing group at the Computer Science Department of the University of Leipzig. He is part of the team at the CLARIN-D centre Leipzig and also engaged in the text corpus creation and archiving processes for the Leipzig Corpora Collection (LCC). His research focuses on the application of business process management in NLP-based research environments.
This is the source
This paper addresses exploratory search in large collections of historical texts. By way of example, we apply our method to a collection of documents comprising dossiers of the former East-German Ministry for State Security, and classical texts. The bases of our approach are topic models, a class of algorithms that define and infer themes pervading the corpus as probability distributions over the vocabulary. Our topic-centered visual metaphor supports to explore the corpus following an intuitive methodology: First, determine a topic of interest, second, suggest documents that contain the topic with "sufficient" proportion, and third, browse iteratively through related topics and documents. Our main focus lies on providing a suitable bird's eye view onto the data to facilitate an in-depth analysis in terms of the topics contained.
Using topic models for exploratory search in large collections of historical texts.
When dealing with large collections of digitized historical documents, very often
only little is known about the quantity, coverage and relations of its content.
In order to get an overview, an interactive way to explore the data is needed
that goes beyond simple lookup
approaches. The notion of exploratory
search has been coined by knows
what she is
looking for, an iterative
Topic modeling
Topic modeling research, however, often focuses on the development of
probabilistic models, i.e. incorporating a richer meta-data structure,
increasing the speed of inference or using nonparametric models to circumvent
model selection problems. Comparatively little effort has been made to develop
methods to use the outcome of these models in applications
In this paper, we present a prototypical visual analysis tool to find and display the relations suggested by topic modeling. We derive distinct exploration tasks from the elements of a topic model, and present visual implementations for these tasks to provide the user with interactive means to browse through relations between documents, topics and words. In this way, the user uncovers expected or unexpected facts that eventually lead to interesting documents. More precisely, we represent topics by tag clouds of different size and, by considering pair-wise topic-similarities, we layout these clouds in the plane to provide the user with a topic-centered view on the data. Using smooth level-of-detail transitions and by interacting with topic distribution charts, the user freely navigates through the data by concatenating single exploration tasks – following focus-and-context concepts and an intuitive methodology: overview first, details on demand.
The rest of the paper is organized as follows: in the next section we briefly
discuss the underlying method, topic modeling. We then review related work in
section 3 from both the language processing point view and from the direction of
presenting topic models (and their alternatives) visually. In section 4, we
define elementary exploration tasks applicable to the outcome of topic models,
followed by descriptions of their visual implementation in our analysis tool in
section 5. We report results from fitting topic models to two different data
sets in section 6. We use Stasi records collected from the former East-German
Ministry of State Security and the ECCO-TCP
Topic models are a family of algorithms that decompose the content of large
collections of documents into a set of topics and then represent each document
as a mixture over these topics (based on the document's content). The outcome is
thus a list of words for each topic (showing the probability of a term appearing
in this topical context) and the proportion of topics for each of the documents.
The key ingredients for finding this structure are word co-occurrences, words in
a topic tend to co-occur across documents and hence are interpreted to share a
common semantic concept (following the assumptions of distributional semantics
(e.g.,
where d are the document lengths, Dir(·) and Mult(·)
respectively denote the Dirichlet and multinomial distribution (see
Visualizing the results of this model is one solution to unveil knowledge hidden in the data. However, the outcome (i.e. the topics and documents' topic proportions) is obviously inappropriate for direct visualization. Without using thresholds, presenting entire probability distributions as sorted lists of words and values is not very handy and quickly results in information excess and cluttered visualizations. Even working with thresholds does not immediately lead to parameter settings that are independent of the input data, e.g. how many words are actually necessary to obtain a reasonably good impression of a topic found by the model. That is, depending on the semantic quality of words and topics, a flexible level-of-detail is necessary to identify meaningful information in a topic. On the other hand, the amount of information relevant for each element of the topic model is assumed to be rather small. Therefore, the visual implementation of these elements should focus on the pivotal parts of the distributions, while increasingly disregarding irrelevant parts. In the end, the relations between the input documents, the latent topics found by the model and the actual probabilities of a topic's keywords are the key elements containing interesting insights about the data.
We emphasize that LDA is just one model that subsumes document collection content
into topics. There exists a numerous amount of different topic models that can
be used alternatively. Besides LDA, others take additional meta-data into
account. These include e.g. the Author-Topic model ds) and a topic-term matrix
(formed by the β_
We also note that we do not go into the analysis of the models themselves but
rather restrict our discussion to the outcome that they produce. Assessing the
quality of the models' results is a research field on its own (e.g.,
Traditional linguistic approaches such as the vector space model
Closely related to our approach is that of
Using the aforementioned outcome of topic models, we aim to provide the user with exploratory means to analyze a corpus by creating a largely topic-centered view on the data and letting latent topics act as the user's main interface to the documents. In this section, we recall the elements of a topic model and classify them into exploration tasks to relate topics to words and documents, and vice versa. The analysis process then consists of concatenating elementary exploration tasks. That is, motivated by intermediate insights about expected or unexpected relationships, the user interactively browses through the data via linked tasks.
The probability distributions resulting from a topic model relate the whole vocabulary to latent topics, and the latter to all documents. That is, the outcome of this model is very complex in that all words occur in every topic, and all topics appear in every document – both with certain "significance" (in fact probability). Because such complex data is hard to handle as a whole, we split the analysis process into distinct exploration tasks to reveal possible relations between single conceptual entities, e.g. documents, and one or more topics, between topics related to a single documents, or between topics related to single words. Based on a simple input-output scheme, every task requires certain information produced by the topic model or provided by the input data and it discloses potentially existent relationships between them.
Examining a single topic is difficult because it is a probability
distribution over potentially thousands of words in the vocabulary.
Technically, this task involves the following information: the topic's
overall significance in the corpus, a meaningful sorting of the words
for appropriate topic description, and actual word significances to
provide pivotal keywords and their relative importance. The quantity of
a topic's overall significance can easily be computed as a relative
measure using the model outcome of topic models: topic-significance k = (∑
The second exploration task is to summarize the set of latent topics
found by the topic model. This includes the following information: the
number of topics, their overall significance for (or impact on) the
corpus, and similarities between topics defined by some measure. While
the overall significance is equal to that in Exploration Task 1, for
topic similarity, different metrics are possible. One score that is
motivated by similar topic distributions was described by
One advantage of topic models is the automatic disambiguation of semantic
meanings of words into topics. A word with different meaningssufficient
probability, and the relevance of the selected
word in these topics to evaluate semantic diversity. In the
visualization, the user should be able to quickly deduce potential
semantical ambiguity by selecting a word in any topic and seeing the
impact of this word in other topics provided by the topic model.
Having identified one or more interesting topics 𝒦 = { i and a
list of documents sorted in decreasing order by the combined impact of
topics 𝒦 on the documents. Given 𝒦, we can easily read off the
probability of these topics in all of the documents θ
Once an interesting document has been identified, the user may want to
inspect other topics related to it, or, in a transitive way, documents
related to these other topics. Again, this task aims at giving the user
a tool for exploring related documents (and thus the corpus) through
picking interesting topics. The following information are involved in
this task: a document of interest, the proportions of other topics in
this document, and document-related documents. While the related topics
of a document simply result from the topic model, the latter information
could be obtained from considering the similarity between topic
distributions between two documents (an example metric is given in
In this section, we explain our analysis tool and provide visual implementations for each exploration task defined in section 4. Furthermore, we describe interactive means to navigate through the data by letting the user concatenate individual tasks, stimulated by the feedback of previous insights and following an intuitive analysis methodology: overview first, details on demand.
The visualization of a topic model should provide quick visual access to the key features and relations in the data. This task is difficult because the broad and complex probability distributions produced by a topic model contain large amounts of irrelevant information. That is, only a minor part of the vocabulary is meaningful to describe a topic, and only some topics have considerable impact on a certain document. Hence, visualizing the topic model as a whole rapidly creates cluttered visualizations. We pursue a visualization approach that illustrates the crucial information of individual exploration tasks, but that also allows the user to refine the level-of-detail in an intuitive and interactive way.
To visualize a single topic we make use of
The topic overview visualization is the main view on the data and the
starting point of any other exploration task. To visualize the required
information for this task, we layout tag clouds in the plane to present them
on the screen. Their number reflects the number of latent topics found by
the topic model and their pair-wise distances in the layout approximate
their similarities; understood either as the difference between probability
distributions or distances in the high-dimensional
To identify polysemous terms and homonyms, the selected word of interest (selection mechanisms will be explained in section 5.2) is highlighted in every topic cloud which provides quick access to this word's significances in other topics via their labels' sizes. Moreover, clouds corresponding to topics in which the selected word is considered insignificant are decolored, i.e. bleached out to facilitate focusing only on those topics with relevant word probability. Because a word's size in another topic could be marginal relative to all other label sizes, or because the current zoom-level is not high enough to identify the word of interest in every cloud, we also provide a chart in a head-up display (HUD) to denote the proportions of this word's probability in each topic. Every part of the chart is colored according to the topic it represents. By inspecting the highlighted labels' size or their corresponding parts in the chart, the user can quickly judge the diversity and partial quantities of a word's different meanings. The topic chart is also a starting point for topic-related exploration tasks (cf. section 5.2).
Given one or more selected topics of interest, the sorted list of documents covering these topics is presented in the head-up display. Each of the scrollable list's entries shows the document's name. The list is also a starting point for document-related exploration tasks (cf. section 5.2).
Each of the document list's entries additionally shows a small chart of the document's overall topic distribution. From these charts, the user can directly read off and compare the impact of the selected topics on every document; also in contrast to all other topics. Selecting a document activates a magnified version of the topic chart to the right of the list in the HUD and serves as a source for topic-related exploration tasks, like examining its words or updating the document list.
The user browses interactively through the data by concatenating exploration tasks. The visual implementations of these tasks can be linked in order to launch subsequent tasks based on intermediate insights about the data (cf. Figure 1 for the different visualization components).
Topics can be selected in two ways: by right-clicking one or more clouds in the overview visualization, or by selecting the corresponding part of a topic chart (see the next to selection mechanisms). A selection of one or more topics triggers the following actions: an accentuation of the corresponding clouds with an additional border to highlight selected topics, and an update of the document list in the HUD to present those documents that share the selected topics, sorted by decreasing combined impact. In addition, if a topic is currently not visible in the cloud overview, a camera movement centers and magnifies it on the screen so that representative words can be read off to get a quick understanding of the topic. Note that this camera movement is actually also part of exploration task 1 because being able to read words associated to a topic is an inherent part of examining it.
As part of exploration task 3, words are selected by left-clicking them in any of the topic clouds. A word selection triggers three actions: the word's accentuation in every other cloud, the decoloring of those clouds that feature only insignificant probabilities for this word, and the creation of a pie chart showing an aggregation of the word's significance in different topics in the HUD, creating starting point for exploration task~4 or exploration task 1.
Documents are selected by clicking on them in the HUD's document list. A document selection triggers the creation of the topic distribution chart for this document, which is placed to the right of the document and is used to trigger the topic-based exploration tasks 1 and 4. Further, topics not relevant for this document are decolored in the cloud overview to quickly identify those that are.
Using these interaction mechanisms, the analysis process is carried out by combining exploration tasks in a transitive or cyclic way. That is, the selection of topics, words, or documents highlights other visual entities and updates widgets which triggers the next exploration task. For example, clicking on a word in one of the clouds (task 1 and 2) creates a topic chart (task 3) in which click-events create the document list for certain selected topics (task 4). Clicking on a document creates a chart for related topics (task 5) whose selection centers a topic cloud (task 1) and updates the document list (task 4)-and so on.
Note that camera movements related to topic selections constantly preserve the visual context in the topic space. That is, for selected topics, the user can always read off keywords to evaluate their meaning and importance and related topics and their overall significance can be identified by examining nearby clouds. By bleaching out topics that do not significantly contribute to one of the topic charts, the user can quickly identify the spatial relation between selected (colored) topics in the overview. This can help to reveal interesting words appearing in deselected topics. Note that we understand our framework as a tool to navigate through the data based on relations between topics and both words and documents. Once interesting documents are identified, their content is presented to the user in a linked view or in the head-up display.
We report use cases of fitting topic models to two different data sets. The first is the series of publications
, i.e. State Security.STAatsSIcherheit
Figures 2 and 3 show two topics extracted from the data sources. Words' sizes are determined by their probability in the topic's distribution over the vocabulary. The topics can easily be identified to circle around the literary genre of drama and communist-party propaganda concerning the youth respectively. As the user zooms in, more words become visible that were hidden because of lesser relevance to the topic, uncovering them reveals a semantic refinement of the topic. This shows that not only the most significant terms define a topic, they merely allude to the topic's semantic meaning that is subsequently defined by the other words with considerable probability mass in it.
Figures 4 and 5 show an overview over the topics found. As described above, the size of the tag clouds represents the overall topic-significance in the whole data set and spatial proximity indicates closer semantic relatedness of topics. One example (cf. Figure 4) is the cluster of topics that cover different aspects of religion and its role in colonization and history. To determine the differences between them the user can follow the methodology for examining a topic. The objective of this view is to provide an initial starting point for further analysis and to draw the user's attention to interesting parts of the data. Figure 5 shows a cluster of topics that are concerned with taking appropriate measure towards different problems in economy and society in GDR. The user gains a first insight into the corpus and is motivated to continue her exploration of the data by concatenating further tasks.
To disambiguate the semantics of a given term, the user selects the word of
interest leading to other topics that exhibit the word with sufficient
significance being highlighted. In Figure 6 the selected word in the
visualization of the ECCO-TCP corpus is greeks
. Other topics that
include this term with sufficient probability are the topics about medicinal
findings about organs, Jesus and religion, and drama. Clearly, greeks
appears in different semantical contexts here: it relates to the Greek
political system and the ancient Greek society in that it had a modern
understanding of medicine (for their time) and stands for famous Greek drama
authors (most prominently Homer). Figure 7 shows the different semantics of
the word "untersuchungen" (investigations) from the Stasi corpus. The term
is used in connection with the ever suspicious Stasi (topics dealing with
addresses, name, Berlin,
(lives at) etc. but also with investigations of accidents in factories
(
meaning Peoples-owned
enterprise) and the state railroad (
).
Akin to example 7, assume that the user is interested in documents that
include the topic described by terms like indian
, cape
,
spaniards
, china
or anchor
. Selecting the topic
creates a list of documents that cover it (in this case travel narratives
and reports from the English colonies) in the middle of the HUD as shown in
Figure 8. The user can scroll through this list and finish one goal of
exploratory analysis. She has identified a topic of interest, examined it to
identify its semantic nature and found a list of documents that cover it.
Figure 9 solves this task on the Stasi data set. The selected topic is about
problems in medical care in former GDR (indicated by the terms
(medical),
(care) and
(problems). The resulting list
includes mainly statistical reports but also documents about the unions'
evaluation to the problem and even about the planned departure of a
physician. This allows for a content driven access to the data. It is also
possible to combine different topics so that we can display documents that
cover a combination of topics (see Figure 1 for one such example).
While exploring the data set and reading documents that cover a topic of
interest, it is often the case that this topic is not the only one covered
by a document. By displaying a chart of the portions of other topics covered
by this document as in Figure 10 the user is encouraged to continue her
exploration by examining these other topics. In our example the user first
selected a topic about anatomy (with terms like organs
,
fluids
, sensations
, uterus
etc.) and then a document
covering this topic. We find a connection to another topic about religion
(jesus
, gospel
, saviour
etc.). Indeed the selected
document's title fully reads
In this paper we have described a visual tool using a tag clouds based approach
to visualize the outcome of topic models. Showing series of word probabilities
as tag clouds easily provides quick visual access to a topic's meaningful
keywords including their significance and qualitative difference to other words
in the topic. That is, compared to a simple list of sorted words, the user can
quickly judge topical distinctness by the ratio between words of high and low
probability, and pivotal keywords also stand out visually in the clouds.
Furthermore, by zooming in and out to change the level-of-detail, the user can
quickly adjust a topic's expressiveness in terms of its keywords; while still
minimizing unnecessary information by keeping the remaining words small and
translucent. We also advanced
We understand our tool as a topic-centered navigator to visually disclose and
present structure hidden in the outcome of the topic model. That is, reading
documents and other document-related exploration tasks are currently not
considered in our tool. We leave the investigation of these tasks and their
visual implementations, and also the expansion of the visual analysis to
time-dependent data for future work. We also omitted further possible
improvements in the preprocessing of data, i.e. before learning a topic model.
These may include stemming, lemmatization and restricting the vocabulary by
confining to certain parts-of-speech. Also, as of now, we are not able to export
findings from our approach and are restricted to identifying document titles.
However, plans to incorporate our visualization into a larger NLP-toolbox (the
Leipzig Corpus Miner