Volume 11 Number 2

# Exploratory Search Through Visual Analysis of Topic Models

## Abstract

This paper addresses exploratory search in large collections of historical texts. By way of example, we apply our method to a collection of documents comprising dossiers of the former East-German Ministry for State Security, and classical texts. The bases of our approach are topic models, a class of algorithms that define and infer themes pervading the corpus as probability distributions over the vocabulary. Our topic-centered visual metaphor supports to explore the corpus following an intuitive methodology: First, determine a topic of interest, second, suggest documents that contain the topic with "sufficient" proportion, and third, browse iteratively through related topics and documents. Our main focus lies on providing a suitable bird's eye view onto the data to facilitate an in-depth analysis in terms of the topics contained.

# Introduction

*exploration*of the corpus is not possible. Our approach is a structured one. We provide the user with a bird's eye view on the data, she then identifies topics of interest and finds the documents related to them. Additionally, these documents may also be related to other topics, a connection that helps to reveal new and interesting insights previously unknown. We are also able to identify different contexts in which specific terms appear, i.e. dissipate semantic ambiguities that may appear. Especially when working with historical texts, this might help to reveal new aspects of known concepts.

*visually*, although recently this task received growing attention (see section 3).

# Topic Models

- for all topics
*k*= 1,...,*K*, draw topics*β**k*~ Dir*V*(*η*) - for all documents
*d*= 1,...,*D*- draw document
*d*'s topic proportion*θ**d*~ Dir*K*(*α*) - for all words n = 1,...,
*N**d*in the document- draw the topic assignment
*z**dn*~ Mult(*θ**d*) - draw the word
*w**dn*~ Mult(*β**z**dn*)

- draw the topic assignment

- draw document

*K*is the number of topics,

*N*

*d*are the document lengths, Dir(·) and Mult(·) respectively denote the Dirichlet and multinomial distribution (see [Kotz 2000]; [Johnson 1997] and

*η*and

*α*are so called hyperparameters (i.e. model parameters) to the Dirichlet distribution. A topic

*β*

*k*is defined as a probability distribution over the word simplex, i.e., in every topic each word has a certain probability and the probabilities in an individual topic sum to 1. The set of words with highest probability is assumed to be different across different topics and to describe the individual topics thematically. Moreover, the assumption is that only a limited fraction of terms exhibit high probability in each topic. We can ensure this by appropriately setting the topic hyperparameter

*η*. The document topic proportions

*θ*

*d*are again probability distributions, defined over the topic simplex, i.e. every topic gets some probability in a document. Each document has its own topic proportions (hence the subscript

*d*), the probabilities of topics for a single document also summing to 1. Again, we assume that only a small number of topics is active in each document and set the hyperparameter α accordingly. The words

*w*

*dn*that we see in a document are now generated by first finding a topic

*z*

*dn*through the document's topic proportions θ

*d*and then finding a word from the chosen topic β

*z*

*dn*. Both choices are random draws from their respective multinomial distributions. During inference, we seek to reverse this generative process in order to get approximations for the governing latent factors that best give rise to the observed words, i.e. we want to find the setting of the latent factors for which the observed words are highly likely. We end up with a suitable approximation for these factors that describes the generation of words assuming our generative model would be true. Note that we have skipped the technical details of how this approximation is achieved, the interested reader is referred to [Blei 2003] or [Heinrich 2005] for a more thorough technical description.

*d*s) and a topic-term matrix (formed by the β_

*k*s). Every model that produces this output (or whose output can be transformed to these structures) is amenable to our visualization. This includes all of the above models (whereas in the Author-Topic model, authors would replace documents conceptually), in fact we could also visualize the outcome of the LSA model which follows completely different approach as topic modeling.

# Related Work

*and*to keep track of the others. Numerous solutions extending this concept exist, e.g. [Snyder 2013] or [Hinneburg 2012], enriching or refining the resulting presentation with different kinds of metadata. [Cao 2010] propose a visualization technique for entities extracted from texts which they call FaceAtlas; a graph-based network visualization augmented with density maps to visually analyze text corpora with documents having relations based on different facets. This approach is similar to ours in that semantic similarity of entities determines spatial distance in the visualization. However, the method of how we arrive at our data model considerably differs. They use Named Entity Recognition (NER) to extract named entities from texts and visualize their relations whereas we extract latent structures (the topics) from the text that define distributions over the vocabulary. Relations between them are implicitly defined by measuring similarity of those distributions with suitable metrics (see e.g., [AlSumait 2009]; [Niekler 2012]). TopicNets [Gretarsson 2012] is a graph-based, interactive analysis tool that incorporates topic models into the mechanics of graph visualization and facilitates the collapsing of nodes based on semantic association, topic-based deformation of node sets, or real-time topic modeling on graph subsets at various levels. Topic Islands~[Miller 1998] uses stereoscopic depictions of topics using wavelets to describe thematic characteristics. ThemeScapes [Wise 1995] uses a terrain-like landscape metaphor to illustrate topics as hills with documents on top. Less complex linguistic approaches translate documents into high-dimensional feature vectors using the vector space model [Salton 1988] in combination with, e.g., the tf-idf [Sparck Jones 1972] term weighting. In this space, words serve as dimensions and documents are finally represented as a point cloud; with (sub-)clusters of documents for each (sub-)topic. Finding and visualizing this high-dimensional structure is a research field on its own. Established approaches include projective techniques, like the Text Map Explorer [Paulovich 2006] Multidimensional Scaling (MDS) [Kruskal 2009], e.g. Sammon's mapping [Sammon 1969]; neuro-computational algorithms like Kohonen's Self Organizing Maps (SOM) [Kohonen 2001], scatterplot matrices [Elmqvist 2008] or topological analysis based on density functions [Oesterling 2010]. However, compared to the modeling approach used in this paper, the insights obtained from the vector space model is rather limited. Our work differs in that we focus on the visual representation of topic model elements to provide more thorough and quicker visual access to the data. We use a more flexible depiction of probability distributions as tag clouds and illustrate topics in an overview image to permit the identification of related topic groups or outliers. Furthermore, we extend the analysis to word-based tasks like finding polysemous and homonymic relations, we use smooth level-of-detail instead of using a fixed number of keywords, and we allow the user to quantify the relative impact of related topics or documents. We also note that there exist a wide variety of different topic models. For example models that impose a network structure on document during model design provide the possibility to interpret the links between documents on a semantic level ([Chang 2010]; [Mei 2008]). However, we want to keep our tool applicable to the widest range of data possible and thus neglect models that make use of other meta data then word frequencies. Further, we do not aim at visualizing links between documents but rather links between topics (which are given by distributional similarity, see below).

# Exploration Tasks

## Definition of Exploration Tasks

### Exploration Task 1 - Examining a Topic

*k*= (∑

*d*θ

*dk*·

*N*

*d*)/(∑

*d*

*N*

*d*), where θ

*dk*is the

*d*-th document's topic proportion of topic

*k*, satisfying θ

*dk*≤ 0,

*k*=1,...,

*K*and ∑

*k*θ

*dk*= 1.

*N*

*d*is the length of document $d$. Relevance determination of a topic's words involves finding a suitable sorting because both tasks are based on the word probabilities provided by the topic model. Since we can assume that the largest part of the vocabulary does not carry topic-specific information, it is reasonable to sort the words by decreasing probability and increasingly disregard their relevance for that topic. Another approach involves determining the words' relevances using a tf-idf [Sparck Jones 1972] flavored procedure; each topic is interpreted as a document and word probabilities are treated as scaled document frequencies. Using a basic tf-idf scheme, we could easily identify words that are highly descriptive exclusively to their respective topics. In the visualization, this information is used to help the user to quickly identify the key words and their relative importance for a topic.

### Exploration Task 2 - Overview over the Topics

### Exploration Task 3 - Finding Different Polysemous and Homonymic Semantics of Terms

### Exploration Task 4 - Identifying Documents Covering a Topic

*k*:

*k*∈{1,...,

*K*}}, the user may want to look at documents that cover these topics. This task is at the core of exploratory analysis. The information required for this are the topics of interest

*k*

*i*and a list of documents sorted in decreasing order by the combined impact of topics 𝒦 on the documents. Given 𝒦, we can easily read off the probability of these topics in all of the documents θ·

*k*,

*k*∈ 𝒦. After sorting the documents by their proportions of the impact product of all topics 𝒦 (we use Πk∈ 𝒦

*k*

*p*(θ

*dk̂*|k̂ = k) to approximate the combined impact of 𝒦 on each document

*d*), we obtain a list of documents that exhibit the topics of interest with decreasing significance.

### Exploration Task 5 - Finding Related Topics of a Document

# Visualization Approach

## Visual Implementation of Exploration Tasks

## Visual Implementation of Exploration Task 1 - Examining a Topic

*tag clouds*[Steele 2010], a popular visual metaphor for weighted word-lists. Although tag cloud implementations can be highly sophisticated, we keep it simple and only focus on the information required for the exploration task. Taking the sorted words, we create labels with size and opacity proportional to the words' probabilities and arrange them in a spiral layout around the most significant word. That is, for each word, we start a spiral from a center point until sufficient space is found to place this word. As a consequence, small and increasingly transparent irrelevant words are positioned at the cloud's border or in the gaps between relevant words. The user can smoothly change the level-of-detail by zooming in and out to make small words appear and to adjust a word's readability (size and opacity) proportional to the zoom-factor. The minimum level-of-detail shows at least the top keyword per topic at full opacity. Furthermore, the tag cloud's extent is scaled by the topic's overall significance in the corpus and each cloud is assigned a distinguishable color to ease further analysis. Example topic clouds are shown in Figure 2.

## Visual Implementation of Exploration Task 2 - Overview over the Topics

*topic space*on the surface of the unit d-simplex. There are plenty of algorithms to create a layout of the clouds that reflects pair-wise (dis)similarities as distances in the plane. For the sake of simplicity, we use either a force-directed approach or Sammon's mapping. Although the probability distributions of the topics are assumed to be sufficiently diverse to minimize cloud occlusions, the user can also scale all pair-wise cloud distances to disperse accumulations. In the overview, the user can quickly distinguish cloud sizes and identify related topics as nearby clouds or cloud accumulation.