Daniel Schüller studied philosophy, linguistics and history at Friedrich-Wilhelms-Universität Bonn and RWTH Aachen University, focusing on philosophy of science and logic as well as on philosophy of language. In 2013 he graduated with an M.A. thesis in which he comparatively investigated the use and heuristics of fictional models in history and physics. Since 2014, he is a research assistant and doctoral student at the chair of Linguistics and Cognitive Semiotics and the Natural Media Lab at RWTH Aachen University. His main research interests include linguistic and semiotic theory – with special emphases on sign processes in co-speech gesture, semiotics in and of gesture research, motion-capture technology, and the field of digital humanities in general.

Marwan Hassani is an assistant professor in the architecture of information systems group at Eindhoven University of Technology, The Netherlands. Previously, he acted as a postdoc researcher and associate teaching assistant at the data management and data exploration group at the RWTH Aachen University, Germany. His research interests include stream data mining, sequential pattern mining of multiple streams, stream process mining, efficient anytime clustering of big data streams and exploration of evolving graph data. Marwan received hid PhD (2015) from RWTH Aachen University. He received an equivalence Master in Computer Science from RWTH Aachen University (2009). He coauthored more than 42 scientific publications and serves on several program committees.

Jennifer Hinnell is a doctoral candidate in the Department of Linguistics at the University of Alberta, Edmonton, Canada. Her research centers around communication in interaction. She uses multimodal corpus data, 3D motion capture data, and experimental methods to explore how people use their bodies, in conjunction with semantic and syntactic structures in speech, to create and convey meaning. Jennifer enjoys fruitful research partnerships with Little Red Hen Distributed Learning Lab (UCLA) and the Natural Media Lab at the RWTH Aachen, Germany.

Bela Brenger studied linguistics and computer science at RWTH Aachen University. He graduated in 2015 with his interdisciplinary thesis analyzing motion-capture data of head gestures in dialogues. Since 2016 he is part of the scientific staff at the chair of Linguistics and Cognitive Semiotics and manages the Natural Media Motion-Capture-Lab. Main interests are data-driven analysis of multimodal communication with emphasis on methods to integrate spatial gesture data and speech.

Irene Mittelberg is Professor of Linguistics and Cognitive Semiotics at the Institute of English, American and Romance Studies at RWTH Aachen University. She directs the Natural Media Lab at Human Technology Centre (HumTec) and the Center for Sign Language and Gesture (SignGes). After gaining an M.A. in French linguistics and art history from Hamburg University, she completed an M.A. and a Ph.D. in Linguistics and Cognitive Studies at Cornell University. Combining embodiment research with classic semiotic theories (e.g. C.S. Peirce, R. Jakobson), Mittelberg’s cross-disciplinary research on language, gesture, space, embodied cognition and the visual arts has emphasized the role of metonymy, metaphor, frames, constructions, and image schemas in multimodal communication. Moreover, Mittelberg and her research team have developed tools and methods to use optical motion-capture technology for empirical gesture research at the juncture of linguistics, semiotics, architectural design, computer science, social neuroscience, and digital humanities.

This is the source

The question of how to model similarity between gestures plays an important role in current studies in the domain of human communication. Most research into recurrent patterns in co-verbal gestures – manual communicative movements emerging spontaneously during conversation – is driven by qualitative analyses relying on observational comparisons between gestures. Due to the fact that these kinds of gestures are not bound to well-formedness conditions, however, we propose a quantitative approach consisting of a distance-based similarity model for gestures recorded and represented in motion capture data streams. To this end, we model gestures by flexible feature representations, namely gesture signatures, which are then compared via signature-based distance functions such as the Earth Mover's Distance and the Signature Quadratic Form Distance. Experiments on real conversational motion capture data evidence the appropriateness of the proposed approaches in terms of their accuracy and efficiency. Our contribution to gesture similarity research and gesture data analysis allows for new quantitative methods of identifying patterns of gestural movements in human face-to-face interaction, i.e., in complex multimodal data sets.

Gesture similarity research and gesture data analysis allowing for new quantitative methods of identifying patterns of gestural movements in human face-to-face interaction.

Introduction

Given the central place of the

Indeed, human communication typically involves multiple modalities such as
vocalizations, spoken or signed discourse, manual gestures, eye gaze, body
posture and facial expressions. In face-to-face communication, manual gestures
play an important role by conveying meaningful information and guiding the
interlocutors’ attention to objects and persons talked about. Gestures here are
understood as spontaneously emerging, dynamic configurations and movements of
the speakers’ hands and arms that contribute to the communicative content and
partake in the interactive organization of a spoken dialogue situation (e.g.

Drawing on this large body of gesture research across various fields of the
humanities and social sciences, the interdisciplinary approach presented here
aims at identifying and visualizing patterns of gestural behavior with the help
of custom-tailored computational tools and methods. Although co-speech gestures
tend to be regarded as highly idiosyncratic in respect to their spontaneous
individual articulation by speakers in spoken dialogue situations, it is safe to
assume that there are recurring forms of dynamic hand configurations and
movement patterns which are performed by speakers sharing the same cultural
background. From this assumption follows the hypothesis that, on the one hand, a
general degree of similarity between gestural forms may be presumed – trivially
– due to the shared morphology of the human body (e.g. kinaesthemes

; Kendon locution clusters

,
McNeill catchments

; Ladewig recurring gestures

, and Cienki

In this paper, we will focus on certain kinds of co-verbal gestures, i.e.
specific image-schematic gestalts, e.g. spirals, circles, and straight paths

Research Objective

Whereas the gesture research discussed above mostly relies on observational
methods and qualitative video analyses, our aim is to add to the catalogue of
methods for empirical linguistics and gesture studies by outlining a
computational, quantitative and comparative 3D-model driven approach in gesture
research. While there is a trend to combine qualitative with quantitative as
well as experimental methods in multimodal communication research

and then to apply this methodology to the recorded 3D numerical MoCap data
of a group of participants.

Both the alignment of gestures with the co-occurring speech, and the semantic
comparison of the established (formally) sufficiently similar gesture-speech
constructions, still have to be done manually by human gesture researchers,
through semiotic analyses of the multimodal, speech and behavioral data corpora.
The primary aim of developing an automated indicator of gesture similarity is to
identify recurrent movement patterns of interest from the recorded 3D corpus
data computationally, and thus to enable human gesture researchers to handle
these data sets in a more efficient manner. In order to make gesture similarity
automatically accessible, we propose a distance-based similarity model for
gestures arising in three-dimensional motion capture data streams. In comparison
to two-dimensional video capture technology, working with numerical
three-dimensional motion capture technology has the advantage of measuring and
visualizing the temporal and spatial dynamics of otherwise invisible movement
traces with the highest possible accuracy. We aim at maintaining this accuracy
by aggregating movement traces, also called trajectories, into a

Properties of the 3D Data Model
Vicon Motion Capture system was used
in this study. Participants wear a series of markers attached to predetermined
body parts of interest (fingers, wrists, elbows, neck, head, etc.). The
Vicon system automatically generates a chart of numerical
4-tuples of Euclidean space-time coordinates for each marker attached to these
points on the participants’ bodies. The movement of the markers is tracked by 14
Vicon infrared cameras, and the physical trajectories of the
markers are represented in a chart of space-time coordinates. These space-time
charts form the data sets that are investigated algorithmically, relieving the
gesture analyst of the difficult, and subjective, task of manually examining
highly ephemeral real-world dialogue situations. But what are the crucial
features that such a numerical representation must have in order to enable
researchers to not only investigate a model but also to finally derive
statements and theories about a modeled real-world situation? We address the
following research questions: Which logical features of the model are essential
if one wants to investigate the real world by investigating a model? And
secondly, what are the epistemic benefits of investigating models instead of
real-world situations?

From a philosophy of science point of view, before being able to apply computational algorithms to naturalistic real-world gestures, there must be a translation from the real-world dialogue situations, involving people speaking and gesturing, from which data are captured, to a computable set of data. For this purpose, a marker-based

The most important feature is that the model identity

relation, in
that two entities have an identical relation to a third entity – a frame of
reference, or a

Definition: Transitivity

For a binary relation

∀x,y,z ∈ A: xRy & yRz → xRz

The transitivity of

Regarding the above-mentioned definition of transitivity, let our variables take the following values:

x = movement of body part from position a to b;

y = movement of marker M from position a to b;

z = trajectory of marker M

Given these values, we outline the transitivity relation as follows:

∀x,y,z ∈ A: xRy & yRz → xRz: x [movement of body part from position a
to b] R y [movement of marker M from position a to b]
& y [movement of marker M from position a to b] R z
[trajectory of marker M] → x [movement of body part from position a to b]
R z [trajectory of marker M].

This means that if

into

fails to be an event

and
aggregated states of affairs

as being synonymous, the problem completely
disappears. Otherwise, we have to re-translate the problematic concept into one
which suits our needs. In terms of epistemic benefits, one major advantage of
the proposed distance-based gesture-similarity model (see the following
section), i.e. the combination of gesture signatures with signature-based
distance functions, is its applicability to any type of gestural pattern and to
data sets of any size. In fact, distance-based similarity models can be utilized
in order to model similarity between gestural patterns whose movement types are
well known and between gestural patterns whose inherent structures are
completely unknown. In this way, they provide an unsupervised way of modeling
gesture similarity. This flexibility is attributable to the fact that the
proposed approaches are model independent, i.e. no complex gesture model has to
be learned in a comprehensive training phase prior to indexing and query
processing. Another advantage of the proposed distance-based gesture-similarity
model is the possibility of efficient query processing. Although calculating the
distance between two gesture signatures is a computationally expensive task,
which results in at least a quadratic computation time complexity with respect
to the number of relevant trajectories, many approaches such as the independent
minimization lower bound of the Earth Mover's Distance on feature signatures

Modeling Gesture Similarity

In this section, we present a distance-based similarity model for the comparison
of gestures within three-dimensional motion capture data streams. To this end,
we first introduce

Gesture Signatures

Motion capture data streams can be thought of as sequences of points in a three-dimensional Euclidean space. In the scope of this work, these points arise from several reflective markers which are attached to the body and in particular to the hands of a participant. The motion of the markers is triangulated via multiple cameras and finally recorded every 10 milliseconds. In this way, each marker defines a finite trajectory of points in a three-dimensional space. The formal definition of a trajectory is given below.

Definition: Trajectory

Given a three-dimensional feature space R3, a
trajectory 3 is defined
for all 1≤i≤n as:

i,yi,zi)

A trajectory describes the motion of a single marker in a three-dimensional
space. It is worth noting that the time information is abstracted to
integral numbers in order to model trajectories arising from different time
intervals. Since a gesture typically arises from multiple markers within a
certain period of time, we aggregate several trajectories including their
individual relevance by means of a gesture signature. For this purpose, we
denote the set of all finite trajectories as trajectory space T=∪k∈N{t| t:{1,…,k}→ R3}
, which is time-invariant, and define a gesture signature as a function from
the trajectory space T into the real numbers R. The formal definition of a
gesture signature is given below.

Definition: Gesture Signature

Let T be a trajectory space. A T is defined as:

S:T→ R subject to |S-1(R{0})|<∞

A gesture signature formalizes a gesture by assigning a finite number of
trajectories non-zero weights reflecting their importance. Negative weights
are immaterial in practice but ensure the gesture space
S={T∧|S-1(R{0})|<∞} forms a vector space. While a
weight of zero indicates insignificance of a trajectory, a positive weight
is utilized to indicate contribution to the corresponding gesture. In this
way, a gesture signature allows us to focus on the trajectories arising from
those markers which actually form a gesture. For example, if a gesture is
expressed by the participant's hands, only the corresponding hand markers
and thus trajectories have to be weighted positively.

A gesture signature defines a generic mathematical model but omits a
concrete functional implementation. In fact, given a subset of relevant
trajectories 𝒯+⊂T, the most naive way of
defining a gesture signature

The isotropic behavior of this approach, however, completely ignores the
inherent characteristics of the relevant trajectories. We therefore weight
each relevant trajectory according to its inherent properties of

Definition: Motion Distance and Motion Variance

Let T be a trajectory space and 3 be a trajectory. The δ:T→R of trajectory

The motion variance mσ 2:T→R of trajectory

as:

The intuition behind motion distance and motion variance is to take into
account the overall movement and vividness of a trajectory. The higher these
qualities, the more information the trajectory may contain and vice versa.
Their utilization with respect to a set of relevant trajectories finally
leads to the definitions of a

Definition: Motion Distance Gesture Signature and
Motion Variance Gesture Signature

Let T be a trajectory space and 𝒯+⊂T be a
subset of relevant trajectories. A motion distance gesture signature Sm δ∈R

A motion variance gesture signature Sm σ2∈R

Motion distance and motion variance gesture signatures are able to reflect the characteristics of the expressed gestures with respect to the corresponding relevant trajectories by adapting the number and weighting of relevant trajectories. As a consequence, the computation of a (dis)similarity value between gesture signatures is frequently based on the (dis)similarity values among the involved trajectories in the trajectory space. We thus outline applicable trajectory distance functions in the following section.

Trajectory Distance Functions

Due to the nature of trajectories whose inherent properties are rarely
expressible in a single figure, trajectories are frequently compared by
aligning their coincident similar points with each other. A prominent
example is the

Definition: Dynamic Time Warping Distance

Let n:{1,…,n}→ R3 and tm:{1,…,m}→ R3 be two trajectories from T and
3×R3→R be a distance function. The δ:T×T→R between
n and m is recursively defined as:

with

As can be seen in the definition above, the Dynamic Time Warping Distance is
defined recursively by minimizing the distances

Although there exist further approaches for the comparison of trajectories,
such as

Given a ground distance in the trajectory space T, we will show in the
following section how to lift this ground distance to the gesture space
S⊂RT in order to compare gesture signatures
with each other.

Gesture Signature Distance Functions

Gesture signatures can differ in size and length, i.e., in the number of
relevant trajectories and in the lengths of those trajectories. In order to
quantify the distance between differently structured gesture signatures, we
apply signature-based distance functions

The Earth Mover's Distance, whose name was inspired by Stolfi and his vivid
description of the transportation problem, which he likened to finding the
minimal cost to move a total amount of earth from earth hills into holes

Definition: Earth Mover’s Distance

Let 1,2∈S be two gesture signatures and
δ:S×S→R between
1 and 2 is defined as a minimum cost flow of all
possible flows

subject to the constraints:

As can be seen in the definition above, the Earth Mover's Distance between
two gesture signatures is defined as a linear optimization problem subject
to non-negative flows which do not exceed the corresponding limitations
given by the weights of the trajectories of both gesture signatures. The
computation of the Earth Mover's Distance can be restricted to the relevant
trajectories of both gesture signatures and follows a specific variant of
the simplex algorithm

The idea of the Signature Quadratic Form Distance consists in adapting the
generic concept of 1,2∈S is then defined as:

The similarity correlation between two gesture signatures finally leads to the definition of the Signature Quadratic Form Distance, as shown below.

Definition: Signature Quadratic Form Distance

Let 1,2∈S be two gesture signatures and
s:S×S→R between 1 and
2 is defined as:

The Signature Quadratic Form Distance is defined by adding the
intra-similarity correlations <1,1>s and <2,2>s of the gesture signatures 1 and 2
and subtracting their inter-similarity correlation <1,2>s. The smaller the differences among the
intra-similarity and inter-similarity correlations the lower the resulting
Signature Quadratic Form Distance, and vice versa. The computation of the
Signature Quadratic Form Distance can be restricted to the relevant
trajectories of both gesture signatures and has a quadratic computation time
complexity with respect to the number of relevant trajectories.

More details regarding the Earth Mover's Distance and the Signature
Quadratic Form Distance as well as possible similarity functions can be
found for instance in the PhD thesis of Beecks

Experimental Evaluation

Evaluating the performance of distance-based similarity models is a highly
empirical discipline. It is nearly unforeseeable which approach will provide the
best retrieval performance in terms of accuracy. To this end, we qualitatively
evaluated the proposed distance-based approaches to gesture similarity by using
a natural media corpus of motion capture data collected for this project. This
dataset comprises three-dimensional motion capture data streams arising from
eight participants during a guided conversation. The participants were equipped
with a multitude of reflective markers which were attached to the body and in
particular to the hands. The motion of the markers was tracked optically via
cameras at a frequency of 100 Hz. In the scope of this work, we used the right
wrist marker and two markers attached to the right thumb and right index finger
each. The gestures arising within the conversation were classified by domain
experts according to the following types of movement: spiral, circle, and
straight. Example gestures of these movement types are sketched in Figure 1. A
total of 20 gesture signatures containing five trajectories each was obtained
from the motion capture data streams. The trajectories of the gesture signatures
have been normalized to the interval [0,1]3∈R3 in order to maintain translation invariance.

The resulting distance matrices between all gesture signatures with respect to
the Earth Mover's Distance and the Signature Quadratic Form Distance are shown
in Figure 2 and Figure 3, respectively. As described in the previous Section, we
utilized the Dynamic Time Warping Distance based on Euclidean Distance as
trajectory distance for the Earth Mover's Distance and converted this trajectory
distance by means of the power kernel

As can be seen in Figure 2 and Figure 3, both Earth Mover's Distance and Signature Quadratic Form Distance show the same tendency in terms of gestural dissimilarity. Although distance values computed through the aforementioned distance functions have different orders of magnitude, both gesture signature distance functions are generally able to distinguish gesture signatures from different movement types. On average, gesture signatures belonging to the same movement type are less dissimilar to each other than gesture signatures from different movement types. We further observed that the distinction between gesture signatures from the movement types spiral and straight are most challenging. This is caused by a similar sequence of movement of these two gestural types. While gesture signatures belonging to the movement type straight follow a certain direction, e.g., movement on the horizontal axis, gesture signatures from the movement type spiral additionally oscillate with respect to a certain direction. Since this oscillation can be dominated by the movement direction, the underlying trajectory distance functions are often unable to distinguish oscillating from non-oscillating trajectories and thus gesture signature of movement type spiral from those of movement type straight.

Apart from the quality of accuracy, efficiency is another important aspect when
evaluating the performance of gesture similarity models. For this purpose, we
measured the computation times needed to perform single distance computations on
a single-core 3.4 GHz machine. We implemented the proposed distance-based
approaches in Java 1.7. The Earth Mover's Distance, which needs on average 148.6
milliseconds for a single distance computation, is approximately three times
faster than the Signature Quadratic Form Distance, which needs on average 479.8
milliseconds for a single distance computation. In spite of the theoretically
exponential and empirically super-cubic computation time complexity of the Earth
Mover's Distance

To sum up, the experimental evaluation reveals that the proposed distance-based approaches are able to model gesture similarity in a flexible and model-independent way. Without the need for a preceding training phase, the Earth Mover's Distance and the Signature Quadratic Form Distance are able to provide similarity models for searching similar gestures which are formalized through gesture signatures.

Conclusions and Future Work

In this paper, we have investigated distance-based approaches to measure similarity between gestures arising in three-dimensional motion capture data streams. To this end, we have explicated gesture signatures as a way of aggregating the inherent characteristics of spontaneously produced co-speech gestures and signature-based distance functions such as the Earth Mover's Distance and the Signature Quadratic Form Distance in order to quantify dissimilarity between gesture signatures. The experiments conducted on real data are evidence of the appropriateness in terms of accuracy and efficiency of the proposal.

In future work, we intend to extend our research on gesture similarity towards indexing and efficient query processing. While the focus of the present paper lies on dissimilarity between pairs of gestures, we further plan to quantitatively analyze motion capture data streams in a query-driven way in order to support the domain experts' qualitative analyses of gestural patterns within multi-media contexts. The overall goal of this research is to contribute to the advancement of automated methods of pattern recognition in gesture research by enhancing qualitative analyses of complex multimodal data in the humanities and social sciences. While this paper focuses on formal features of the gestural movements, further steps will entail examining the semantic and pragmatic dimensions of these patterns in light of the cultural contexts and embodied semiotic practices they emerge from.

Acknowledgment

This work is partially funded by the Excellence Initiative of the German federal
and state governments and DFG grant SE 1039/7-1. This work extends