Abstract
In this paper, we present a case study on quality criteria for the robustness of categories
in pragmalinguistic tagset development. We model a number of classification tasks
for linguistic routines of discourse referencing in the plenary minutes of the German
Bundestag. In the process, we focus and reflect on three fundamental quality criteria:
1. segmentation, i.e. size of the annotated segments (e.g. words, phrases or sentences),
2. granularity, i.e. degrees of content differentiation and 3. interpretation depth,
i.e. the degree of inclusion of linguistic knowledge, co-textual knowledge and extra-linguistic,
context-sensitive knowledge. With the machine learnability of categories in mind,
our focus is on principles and conditions of category development in collaborative
annotation. Our experiments and tests on pilot corpora aim to investigate to which
extent statistical measures indicate whether interpretative classifications are machine-reproducible
and reliable. To this end, we compare gold-standard datasets annotated with different
segment sizes (phrases, sentences) and categories with different granularity, respectively.
We conduct experiments with different machine learning frameworks to automatically
predict labels from our tagset. We apply BERT ([Devlin et al. 2019]), a pre-trained neural transformer language model which we finetune and constrain
for our labelling and classification tasks, and compare it against Naive Bayes as
a probabilistic knowledge-agnostic baseline model. The results from these experiments
contribute to the development and reflection of our category systems.
1. Introduction
This study investigates discourse referencing practices in parliamentary debates from
a linguistic perspective.
Discourse referencing is present in sentences in which a speaker makes references to preceding utterances
within the discourse. We therefore study intertextual references to oral utterances
and written texts. The visualization and automated recognition specifically of such
practices opens up relevant new perspectives of insight. Firstly, they serve as a
starting point to uncover and analyze intertextual reference structures in more detail
in subsequent applications. These could be analyses according to subject areas or
discourses (the parliamentary-procedural, the economic, the academic), or according
to types of reference objects (written text types, oral utterances), or to the relation
of references to party affiliation, for example. Secondly, the study of communicative
practices in parliaments is fundamentally relevant for understanding the mechanisms
of Western parliamentary democracies. The analytical annotation and automated recognition
of different types of such practices are important prerequisites for the further investigation
of their mutual interaction in different contexts. And thirdly, categorizing practices
of discourse referencing is methodologically interesting for digital pragmalinguistics,
because it addresses a fundamental challenge in the field: On the one hand, pragmalinguistic
phenomena can be indicated on the linguistic surface and thus be recognized, also
in an automated way, but on the other hand, the capture of implicit and inferred aspects
as well as the inclusion of contextual knowledge are of central importance. This makes
interpretative analysis indispensable and requires the training of algorithms through
manual annotation. [
Archer et al. 2008, 615] have pointed out this particular difficulty for annotation studies with an
automation perspective.
In linguistic heuristics, discourse referencing belongs to pragmatics because it involves
linguistic practices whose function can only be inferred based on contextual knowledge.
This may not seem to be evident at first glance. Some forms of discourse referencing
are easily detectable on the linguistic surface. Consider, for example, explicitly
marked quotations or communication verbs (such as
say or
promise). However, discourse referencing can also be indicated implicitly. Formulations,
such as
“With your behavior […] you have placed yourself in an improper proximity […]” (
1) in the following example, require interpretation based on contextual knowledge to
be identified as practices of discourse referencing. Here, an interpretative effort
would lead to understanding
“behavior” as a linguistic action rather than, say, physically violent behavior:
1. With your behavior [...] you have placed yourself in improper proximity to your neighbors
here further to the right. (all examples are translated by the authors)
[Sie haben sich mit Ihrem Verhalten […] in eine ungute Nähe zu Ihren Nachbarinnen
und Nachbarn hier weiter rechts begeben.]
Such contextual and interpretative phenomena cannot be simply captured by corpus linguistic
or algorithmic access to the linguistic surface, which makes them difficult to analyze
in an automated way: While linguistic surface patterns (e.g., word order, collocations,
word frequencies or the distributions of words or larger linguistic constructions)
can be detected easily, their exact meaning and pragmatic function may not be fully
captured on this level by machines due to the missing context knowledge. One approach
to solving this problem is to combine interpretive-categorizing annotation and machine
learning. The application of this methodological approach to the subject of discourse
referencing has so far been a research desideratum, much more so with a focus on category
development with automatability in mind.
While discourse referencing as our linguistic research object is important in its
own right for understanding the mechanism of parliamentary discourse, here we focus
on the methodological aspect of category development concerning the automated detection
of such references in large datasets. For this purpose, we conduct a collaborative
annotation study and run experiments with probabilistic classifiers such as Naive
Bayes [
Jurafsky and Martin 2022] and transformer language models such as BERT [
Devlin et al. 2019]. As part of this study, we methodologically describe and discuss the development
of an annotation category system on the object of discourse referencing with automation
possibilities.
We obtain the dataset for our case study from the linguistically preprocessed corpus
of the plenary minutes of the German Bundestag ([
Müller and Stegmeier 2021], cf. [
Müller 2022b]).
The category system combines deductive and inductive categorizations. In a first step,
we form categories for discourse referencing that stem from linguistic theories. In
a second step, we have to adapt these or create new categories for forms and cases
that we only recognize in the course of data exploration – especially for cases of
implicit discourse referencing, such as “you have placed yourself […],” and others. The central challenge with this approach is to capture the phenomena
under investigation as precisely as possible and at the same time to maintain a certain
balance of granularity and variance in the category contents.
In the following, we first provide an overview of preliminary work on category design
in pragmalinguistics and linguistic discourse research. We focus on already-known
success factors in the formal and contextual tailoring of categories. Next, we introduce
the pragmatic phenomenon of discourse referencing and describe the properties that
are relevant to our heuristic model building. Subsequently, we describe and discuss
our dataset and the collaborative annotation of discourse referencing practices in
terms of assumptions, process and results. The annotation process consists of two
phases: 1. We test the aspect of categorization granularity by modelling a binary
classification task (discourse referencing present or not). 2. We tag our data in
a more fine-grained way, focusing on the actors (authors/speakers) of referenced utterances
(actors mentioned or not), and additionally extracting phrases that have been identified
to indicate discourse referencing. In addition to this, we run linguistic experiments
using probabilistic and neural classifiers to detect discourse referencing. In this
set of experiments, we test the influence of different input data in terms of taxonomies
(number of categories) and segment sizes (phrase input vs. sentence input). By doing
so, we also investigate the interplay between form and meaning. We analyze its impact
on both algorithmic models and collaborative manual annotation: Does annotating smaller
segment sizes, which are more specific to the phenomenon under investigation, or entire
sentences containing the phenomenon, align better with the content-conceptual granularity
of the category in question? Finally, we discuss our results on the question of category
design and conclude with a summary.
2. Capturing discourse referencing by annotation
2.1 Criteria for the development of machine-learnable categories in a pragmalinguistic
annotation approach
Numerous issues, aspects and criteria for the development of category systems have
been discussed in the literature on pragmalinguistic annotation. [
Archer et al. 2008, 615] differentiate five levels of pragmatic information relevant to category development:
the formal, the illocutionary, the implied/inferred, the interactional and the contextual
level. The consistent consideration of these level differences is seen as an important
criterion for the design of annotation schemes. In particular, [
Archer et al. 2008, 633] highlight segmentation:
“Segmentation requires us not only to state what unit we will be analysing, but also
to define it in a way that will enable us to measure one unit against another, and,
by so doing, ensure a level of consistency.” Segmentation thus refers to the size of annotated units on the linguistic surface
(e.g., phrases or sentences) chosen according to the conception of the category system.
This aspect has been described as an important quality criterion in other works in
the field as well, e.g., in the annotation of speech acts (c.f. [
Leech and Weisser 2003]). Teufel also addresses the segmentation problem – from a more computational linguistic
point of view – she reflects on the difficulty of assigning abstract categories to
linguistic units. She also addresses the problem that categories can overlap but is
critical of multiple annotations with regard to evaluability (cf. [
Teufel 1999, 108]). Instead, she opts for selective annotation with exclusive categories and
consistent segmentation ([
Teufel 1999, 111]; cf. [
Weisser 2018, 213–277]).
These aspects – consistent segmentation and a distinctive category system – have likewise
proven crucial in our previous studies on pragmalinguistic annotation, also concerning
the combination of pragmatic annotation and machine learning. In addition to these
two aspects, we have worked out the factors of granularity of categories and context
sensitivity/depth of interpretation in prior studies ([
Becker et al. 2020]; [
Bender 2023]). To give an example of different category granularities in a system: In [
Becker et al. 2020], we treated discourse referencing as a subcategory of relevance marking and again
distinguished more fine-grained between directed and undirected discourse referencing,
thus had three levels of granularity in one category. We developed a complex annotation
scheme with pragmalinguistic categories at different levels of granularity to study
academic text routines (e.g., relevance marking, definition, argumentation). We used
this scheme to manually annotate sentences in a corpus of texts from different academic
disciplines and then to train a recurrent neural network for classifying text routines.
The experiments showed that the annotation categories are robust enough to be recognized
by the model, which learns similarities between sentence surfaces represented as vectors.
Nevertheless, the accuracy of the model depended strongly on the granularity of the
category level [
Becker et al. 2020, 450–455].
In general, pragmalinguistic questions raise the challenge of operationalizing and
segmenting phenomena that are context-dependent rather than bound to a formal segment.
In a great number of cases, discourse referencing acts can be delimited to certain
phrases. However, there are cases – e.g., certain anaphoric references – where the
indicators of discourse referencing can only be fully captured in the extended cotext,
i.e., the surrounding sentences/utterances at a definable distance from the focused
utterance – as opposed to context as extra-linguistic, e.g., social and situational
conditions and knowledge backgrounds. Thus, in addition to the aspect of segmentation
consistency, the granularity of segmentation and the size of the cotext window are
also important.
Both granularity and distinctiveness are relevant factors for the segmentation and
also for the robustness of the category system as a whole. Granularity determines
the semantic and pragmatic content of the categories in annotation schemes. The granularity
of the tagset influences the accuracy of the algorithm (cf. [
Becker et al. 2020, 455]). This does not mean that schemes with few categories or tags are always better.
Rather, it is important to capture a certain phenomenon as well as possible through
the operationalization in the scheme and to make it analyzable at first. Secondly,
insufficiently differentiated tagsets lead to overly heterogeneous categories, which
in turn limits machine learnability.
The annotation guidelines need to consider this. For instance, they need to specify
exactly how much communicative and contextual knowledge may be included and how deeply
it is to be interpreted to determine whether an utterance is a reference to a communicative
act – even in cases where this is not made explicit through according lexis (see example
in the
introduction).
To achieve agreement in the annotation process, the team of annotators must reach
explicit common ground on the depth of interpretation when assigning segments to categories.
The more cotext/context is available to annotators, the more they will interpretatively
work out what was
“actually” meant by a sentence, and the higher the risk that annotators will disagree. Therefore,
it may be useful to deliberately limit the co-textual information and thus limit the
depth of interpretation. Categories designed to be distinctive (allowing no overlap
of categories) and exhaustive (covering the whole variety of phenomena in the data)
have proven to optimize machine learning [
Becker et al. 2020, 430]. This robustness can be evaluated by calculating the inter-annotator agreement
[
Artstein and Poesio 2008, 555–596]. The above-mentioned factors also represent quality criteria for the explicitness
and intersubjective comprehensibility of interpretative categorizations in annotation
studies, i.e., they determine whether categorizations are compatible with machine
learning, for one, and comprehensible for human addressees, such as other annotators
or recipients of the respective study, for another. Besides this, the accuracy values
of the different algorithmic models we will test represent verification results.
In summary, our category development considers the factors of segmentation, granularity,
distinctiveness and context sensitivity/depth of interpretation on different levels
as well as in their mutual interaction with the machine learnability of the category
system in experiments. In this study, we draw on these findings and test the effects
of changes in these factors as well as their impact in various experiments (on the
Inter-annotator agreement and the learning success of different algorithmic models).
Furthermore, we test whether the trained algorithmic models cope better with sentence
segmentation or with phrase-level segmentation.
2.2 Linguistic routines of discourse referencing
By discourse referencing, we mean referring to a preceding communicative act within
discourse ([
Müller 2007, 261]; [
Feilke 2012, 19]). We are thus dealing with particular cases of intertextuality [
Allen 2000]. These are characterized by concrete and explicit references to other communicates,
which can be called
“texts” in a broad sense. They include not only pre-texts such as laws, templates, drafts,
and policy papers but also oral utterances. In all cases, the referenced act is in
the past from the speaker’s point of view. The reference can be uttered as a complete
proposition, as a verbal phrase (VP), or as a noun phrase (NP) (see examples in
section 3.2), with the subject of the utterance fully named, metonymically named, or without
naming the subject of the utterance. Discourse referents in this sense are constitutive
of many genres, e.g., academic or legal discourse.
Communicative practices in parliaments are fundamentally relevant for understanding
the mechanisms of Western parliamentary democracies. But discourse references in parliamentary
discourse also have functions that are interesting in terms of linguistic systematics:
First, they serve to orient and co-orient political statements in different discourses
(citation
2; e.g., the parliamentary-procedural, the economic, the academic); second, they are
used to index institutional and situational coalitions or oppositions (
3); and third, they are used to invoke the legal basis of parliamentary action (
4; laws, directives, regulations). In the sense of this last point, discourse references
serve to recall the distinguished function of the parliamentary arena as a laboratory
in which the legal framework of our social life is forged.
2. Those who say this are subject to an essential misjudgment because they do not know
or misjudge what great added value it means in terms of acceptance and industrial
peace when important decisions are discussed beforehand in the works council and then
implemented together in the company.
[Die, die das äußern, unterliegen einer wesentlichen Fehleinschätzung; denn sie wissen
nicht oder schätzen falsch ein, welchen großen Mehrwert es im Hinblick auf Akzeptanz
und Betriebsfrieden bedeutet, wenn wichtige Entscheidungen zuvor im Betriebsrat besprochen
und dann gemeinsam im Betrieb umgesetzt werden.]
3. The suitable and also still possible minimal invasive solution in the remaining weeks
is an opening of the contribution guarantee, which leads also according to the opinion
of science to more net yield and more security.
[Die passende und auch noch mögliche minimalinvasive Lösung in den verbleibenden Wochen
ist eine Öffnung der Beitragsgarantie, die auch nach Meinung der Wissenschaft zu mehr
Rendite und mehr Sicherheit führt.]
4. Please read the act first, before you argue in a populist way here.
[Lesen Sie doch bitte erst das Gesetz, bevor Sie hier populistisch argumentieren.]
One can see from these first examples that the focus and concreteness of the intertextual
reference varies considerably. (
2) contains a reference to a concrete and theoretically precisely determinable group
of speakers antecedent in the discourse, but introduced into the discourse only unspecifically
(
those who say this). In (
3), there is a similarly unspecific reference that is introduced with a metonymic shift
(
according to the opinion of science instead of
“according to the opinion of some academic scholars who are concerned with this issue”). In (
4), a legal statute is referred to as the manifest result of a communicative act, without
addressing the actors involved in the writing of the statute at all. Such a reference
to texts as instances independent of the author, as it were autonomously effective,
is a common rhetorical procedure in parliamentary debates.
In other cases, of course, utterances refer to concrete empirical persons. These can
be groups (see example
5), or individuals (
6). Besides this, there are (albeit rare) cases in which reference is made to a preceding
text in the discourse, such that the text itself takes the place of the actor in a
communicative action (
7). These metonymic shifts are interesting because they give a different hue to the
action structure of the discourse that is being produced using discourse referencing:
the cognitive focus, the claim of validity, and also the authority are shifted from
the author to the text in such cases. Methodologically, what is interesting here is
the extent to which such metonymic constructions can be found automatically, especially
since they are rare.
5. After all, the concern of the democratic opposition groups is a correct one.
[Denn das Anliegen der demokratischen Oppositionsfraktionen ist ja ein richtiges.]
6. Ladies and gentlemen, Kohl, a historian by training, once said: “Those who do not know the past cannot understand the present and cannot shape the
future.”
[Meine Damen und Herren, der gelernte Historiker Kohl hat einmal gesagt: “Wer die Vergangenheit nicht kennt, kann die Gegenwart nicht verstehen und die Zukunft
nicht gestalten.”]
7. The report confirms: Inner cities are losing their individuality and thus their attractiveness.
[Der Bericht bestätigt: Die Innenstädte verlieren ihre Individualität und damit Attraktivität.]
We exemplify our methodological considerations and experiments on category design
with the following research questions: 1. Which types of discourse referents occur
in our data set and in which distribution? 2. What role do actors play in discourse
referencing? That is, when are the speakers and writers of utterances explicitly named,
and when, instead, in a metonymic thrust, does the text itself move into the position
of the actor (as in evidence
7)?
3. Dataset and annotation workflow
3.1 Dataset
To investigate discourse referencing in parliamentary discourse, we draw on the plenary
minutes of the German Bundestag [
Müller and Stegmeier 2021]. Discourse Lab [
Müller 2022a] hosts a linguistically processed and metadata-enriched corpus of the plenary minutes
that currently covers the period from 1949 to May 2021, i.e., all completed election
periods from 1 to 19. The corpus contains about 810,000 texts (debate contributions)
and about 260 million tokens. It is expanded at regular intervals with current data
[
Müller 2022b] which is provided by the German Bundestag (https://www.bundestag.de/services/opendata).
Pre-processing includes tokenization, sentence segmentation, lemmatization, part-of-speech
tagging, marking of speakers’ party affiliation, and separate marking of heckling.
This way, speeches with and without heckling or even heckling separately can be searched.
The basic unit (
<text>) of the corpus is the parliamentary speech. It is subclassified by speakers’ texts
<sp> and heckling
<z>. Text attributes are speaker’s fraction, year, month, speaker, session, legislative
period, text ID and day of the week. The corpus is managed via the IMS Corpus Workbench
[
Evert and Hardie 2011]. For our categorization experiment, we draw a random sample of 6,000 sentences from
the May 5–7, 2021 plenary transcripts. We exclude hecklings in the process. The sample
is homogeneous across time and actors: Since our study is about methodological experiments
on category formation, the variation of parameters should be controlled. With the
sample design, we exclude diachronic variation and variation caused by changing groups
of actors. We include various types of discourse referencing in that our dataset covers
functional, thematic, and interpersonal variation.
3.2 Collaborative annotation
The first part of our experimental annotation study on discourse referencing focuses
on collaborative manual annotation. We consider collaborative annotation to mean not
only that several annotators assign categories, but also that categories and guidelines
are developed in a team (cf. [
Bender and Müller 2020]; [
Bender 2020]). The understanding of discourse referencing described in Chapter 3 requires linguistic
expertise – at least in less explicit cases. Thus, we cannot simply assume everyday
linguistic intuition to be sufficient but must develop criteria and guidelines and
make them available to annotators, or at least train them to some extent in the application
of the guidelines. Of course, it is best to involve all annotators in the development
of the categories as well, if possible. We have been able to do this, at least in
part, in the study described here. For this purpose, we discussed the theoretical
concept of categorization in the team and, on this basis, first established criteria
for assigning categories to segments.
The basic unit of annotation was set to be sentences. The reason for this is that
linguistic actions are typically represented in sentences. Co-textual information
was intentionally narrowed down in this study by extracting individual sentences and
making them available to annotators in random order. Within this cotext window, not
all discourse referencing can be fully resolved even in terms of unambiguous attribution
to prior utterances, but the indicators of discourse referencing can be detected at
the individual sentence level by context knowledge/language knowledge (without further
cotext). In this respect, the unit sentence, which can also be delimited and quantified
for algorithmic models, was given preference here over, for example, freely selectable
text sections as larger cotext windows. No overlap of categories was allowed in the
annotation. The next smaller unit in the linguistic system is phrases, which were
used in this case for the extraction of classification-relevant indicators. Evident
indicators of discourse referencing are phrases with communication verbs and noun
phrases that introduce sources of referenced utterances (i.e., authors, speakers).
Other – context-sensitive – indicators were identified in the collaborative data analysis
in the course of pilot annotations. For example, discourse references in parliamentary
discourse are also made with action verbs in conjunction with nominal mentions of
texts or utterances (e.g., “with the draft we initiated the debate”).
After determining relevant categories deductively, trial annotations were carried
out. The category system was revised inductively in a data-driven manner and team
members discussed cases of doubt. The abductive differentiation or reconfiguration
of the scheme is necessary when the assignment of text segments ([
Pierce 1903] calls it
“percept”) to categories (
“percipuum,” [
Pierce 1903]) by qualitative induction fails in the course of annotation. In our annotation process,
however, we understand this new construction or configuration not as a result of purely
individual insights, but as a collaborative-discursive process of negotiating categories
that are plausible for all annotators.
An additional goal that made this collaborative discursive negotiation process even
more complex was to combine a linguistic analysis perspective with computational linguistic
expertise to better anticipate what different machine learning algorithms can capture.
For example, we decided against annotating verbatim quotations and indirect speech
because we wanted to train the algorithmic models primarily on indicators which show
that referencing is taking place, instead of focusing on what is being referenced.
After all, the formation of linguistic routines occurs at the level of referencing,
while what is referenced can vary indefinitely. Since we aim to discuss the question
of category design at the intersection of disciplinary heuristics and machine learning,
we developed an annotation workflow that allows us to conduct machine learning experiments
on categories of varying complexity in terms of form and content.
We decided on different levels of annotation complexity for which we developed the
appropriate categories:
Annotation step |
Complexity level |
Category |
Segment |
Classification decision |
Possible numbers of segments per instance |
1 |
1 |
discourse referencing |
sentence |
yes/no |
1 |
2a |
2 |
mention of the source (author/speaker) of the referenced utterance |
sentence |
explicit/metonymic/none |
1 |
2b |
3 |
discourse referencing |
phrase |
yes/no |
n |
Table 1.
Manual annotation – workflow.
Table 1 presents the different annotation steps, which are designed according to increasing
complexity: Step 1 is a binary classification task with two labels. In step 2, we
ran two annotation tasks at the same time. First, different types of thematization
of authors/speakers of the textual and oral utterances were classified – at the sentence
level: explicit/metonymic/none. Second, within the sentences that were already classified
as discourse referencing, those phrases that were relevant to the classification decision
were identified (see
Table 2). This step requires accurate annotation of phrases representing relevant actors,
actions and products. Even though step 2b is a binary classification task, the decisions
required for classification are even more complex because any number of segments can
be annotated for each instance and the three-item classification from step 2a is presupposed.
The first annotation phase consisted of a binary classification task that required
distinguishing between sentences with and without discourse referencing. According
to this criterion, all 6,000 sentences of the corpus sample were double annotated
(sentences as segments). Teams of two performed the annotation of 3,000 sentences
each in Excel spreadsheets independently. The sentences were arranged in random order
to avoid possible cotext/context effects. After double annotation, the inter-annotator
agreement was calculated based on Cohen’s kappa [
Cohen 1960]. Agreement scores varied among groups in the first run. In group 1, 2,566 of 2,919
sentences were annotated in agreement (88%, Cohen’s kappa: 72.87), in group 2, 2,408
of 2,883 (83.5%, Cohen’s kappa: 57.44). The difference in kappa score between the
groups is linked to the fact that in group 2 the rarer label (
“+ discourse referencing”) was assigned less frequently in agreement (in 487 cases), due to a misunderstanding
that became apparent late in the annotation process. This had a disproportionately
large impact on the calculation of agreement statistics using Cohen’s kappa. That
is because more infrequent labels are calculated to have a lower probability of overruling
annotations by random chance than high-frequency ones. Cohen’s kappa is designed to
compute the randomly corrected matches of annotations from different annotators. This
way, it expresses a ratio between the randomly expected agreement and the observed
agreement, assuming that annotators can also assign the same label to an instance
by random chance with a certain probability (cf. [
Greve and Wentura 1997, 111]; [
Zinsmeister et al. 2008, 765f]).
The average agreement score was nevertheless acceptable (Cohen’s kappa: 65.02). Kappa
scores are evaluated differently in the literature. [
Greve and Wentura 1997] categorize kappa scores above 75 as excellent, and scores between 61 and 75 as good.
In more recent NLP work, even lower values are accepted as good (e.g., [
Ravenscroft et al. 2016]; cf. [
Becker et al. 2020, 442]). Based on this assessment of the kappa value and the high degree of any other
agreement between the annotations, the results of phase one were accepted as the basis
for the second phase. That is, all cases in which different categories were assigned
were filtered out. These cases were then decided by an independent annotator according
to the criteria of the guidelines. 1,935 of 6,000 sentences (32.25%) were identified
as discourse referencing, which indicates the importance of such practices in parliamentary
discourse.
In the second annotation phase, these 1,935 sentences were annotated according to
a more fine-grained scheme: The classification task was to distinguish discourse references
in which the actor (author/speaker) of the referenced utterance is explicitly named
from those in which the text becomes the actor in a metonymic shift and those in which
no actors are named (see
Table 2).
Tag |
Description |
Example |
1 |
Actor explicitly mentioned. |
Twelve years ago, the Chancellor, together with the prime ministers of the time, proclaimed
the “7 per cent” goal. [Die Kanzlerin hat gemeinsam mit den damaligen Ministerpräsidentinnen und Ministerpräsidenten
vor zwölf Jahren das Ziel “7 Prozent” ausgerufen.] |
2 |
Metonymic mention of the actor. |
Our Basic Constitutional Law obligates us to create equal living conditions in Germany. [Unser Grundgesetz verpflichtet uns zur Schaffung gleichwertiger Lebensbedingungen
in Deutschland.] |
3 |
No actor mentioned. |
The recommended resolution is adopted. [Die Beschlussempfehlung ist angenommen.] |
Table 2.
Tagset of the second annotation round.
As a result, we measured a very good agreement (Cohen’s kappa: 84.35). After curating
the annotations and producing the gold standard, 721 sentences (37.26%) were assigned
to category 3, 1,155 (59.69%) to category 1, and 59 (3.05%) to category 2.
In the same step, we extracted the phrases that had been rated by the annotators as
crucial for the categorization as discourse referencing. These included, for example,
noun phrases (NP) representing communicative acts or texts or discourse actors (without
heads of embedding phrases such as prepositions in a prepositional phrase) or relevant
verb phrases (VP) (including verbs that express communicative action, as shown in
the examples) without complements and adverbials.
Table 3 gives an example of phrase extraction.
Categorized sentence |
Extracted phrases critical to categorization |
Phrase type |
Referenced |
For me, there are three good reasons to reject this proposal of the AfD today: The
first is the sheer thin scope already mentioned by colleague Movassat; I do not need
to say much more about it. [Für mich gibt es drei gute Gründe, diesen Antrag der AfD heute abzulehnen: Der erste
ist der schon vom Kollegen Movassat erwähnte schiere dünne Umfang; dazu brauche ich
nicht mehr viel zu sagen.] |
this proposal of the AfD [diesen Antrag der AfD] |
NP |
text |
colleague Movassat [Kollege Movassat] |
NP |
actor |
mentioned [erwähnte] |
VP |
utterance |
Table 3.
Manual phrase extraction from sentences categorized as
“discourse referencing.”
The phrases “to reject” [abzulehnen] and “I do not need to say” [brauche ich nicht … zu sagen] were not extracted because they represent possible
future utterance acts, not preceding ones.
This extraction was intended to work out what annotators are looking at when they
detect discourse referencing. In machine learning, an “attention mechanism” is used to try to mimic human cognitive attention. The extraction of relevant phrases
will be used to test whether this principle can be supported in this way. Which effects
can be observed will be reflected in the next chapter.
4. Automatic classification/machine learning
In this section, we describe how we build and apply different machine learning algorithms
to detect and classify discourse references in political debates. The goal of this
research is to assess the ability of computational models such as traditional classification
algorithms as well as Deep Learning techniques to detect discourse references in texts
and to classify them.
4.1 Task Description
As mentioned before, we developed our category scheme with regard to the machine learnability
of the different labels and paid particular attention to the factors segmentation,
granularity, distinctiveness and context sensitivity. In line with the two phases
of annotations, as described above, we designed two tasks for probing the ability
of computational models to learn our category system:
- Task 1: Detecting discourse references. In the first annotation phase, our annotators had to distinguish sentences with discourse
referencing from sentences without discourse referencing. For computational modelling,
this can be framed as a binary classification task; the task of detecting discourse
references in texts on the sentence level: Given a sentence, the task is to predict
if this sentence contains a discourse reference or not. We use each of the given 6,000
sentences as input, and let the model predict for each of them one of the two labels
discourse referencing (1) and no discourse referencing (0).
- Task 2: Classifying types of discourse references. The second task is to classify the discourse references into three categories: Actor explicitly mentioned, Metonymic mention of the actor and No actor mentioned (see Table 4). We use all instances that have been annotated as discourse references in the gold
version of our annotations for training and testing our models (n=1,935). We experiment
with three different input formats: providing the model (a) with the full sentence
as input, (b) only with the phrase marked as relevant for discourse referencing as
input, and (c) with both, the full sentence and the marked phrase, by concatenating
the sentence and the phrase, separated by a separator token.
We then train and evaluate the models in three settings. In the first setting A, all three categories are taken into account. In the second setting B, the least
frequently assigned category metonymy is excluded. The idea behind that is that most
machine learning approaches suffer from imbalanced datasets and in particular from
minor classes which are represented by too few examples. With setting B, we therefore can test how much the small size of our minor class metonymy affects
our results. In the third setting C, we finally combine categories 1 and 2, which are both actor-naming categories, and
contrast them with category 3, in which no actors are mentioned. In this way, we can
reveal if our models can distinguish between instances that focus on the actors, and
instances that leave the actors implicit.
4.2 Description of Models
To investigate to which extent our category system as described above can be learned
by machine learning techniques, we test the ability of two different supervised machine
learning approaches: (I) Naive Bayes, a traditional classification algorithm, serves
as our baseline model and is compared to (II) BERT, a State-of-the-Art Transformer
Language Model that has shown great success in various NLP tasks. Both models are
applied to detect (Task 1) and classify (Task 2) discourse references in texts.
Baseline Model – Naive Bayes. Naive Bayes is a probabilistic classifier that makes assumptions about the interaction
of features [
Jurafsky and Martin 2022, 59]. The text is treated
“as if it were a bag-of-words, that is, an unordered set of words with their position
ignored, keeping only their frequency in the document.” [
Jurafsky and Martin 2022] This means that first, the occurrence of words in a category is counted (
“bag-of-words”). Then, for each word, the probability that it occurs in each category can be calculated.
For each new observation, a probability value is calculated based on each category.
That means, it is assumed at first, that the sentence belongs to category 1. The overall
probability of category 1 to be classified is then added to the probabilities of each
word to occur in category 1. In the next step, the same calculation is performed,
assuming the new observation belongs to category 2. After calculating these values
for each category, the values are compared with each other. The category with the
highest value is the prediction of the classifier.
For our approach, we use the Multinomial Naive Bayes model as implemented in the Python
package scikit-learn [
Pedregosa et al. 2011]. We use 90% of the data for training and keep 10% for testing.
Transformer Language Model – BERT. The application of pre-trained language models, such as BERT [
Devlin et al. 2019], GPT [
Radford et al. 2019] or XLNet [
Yang et al. 2020], has recently shown great success and led to improvements for various downstream
NLP tasks. Through pre-training on large textual corpora, these models store vast
amounts of latent linguistic knowledge ([
Peters et al. 2018]; [
Orbach and Goldberg 2020]). After pre-training, the models can be fine-tuned on specific tasks with a small
labelled dataset and a minimal set of new parameters to learn.
Language models have been successfully applied to various language classification
tasks, such as emotion classification [
Schmidt et al. 2021], sentiment analysis [
Yin and Chang 2020], and relation classifications [
Becker et al. 2021]. Inspired by these insights, we make use of the latent knowledge embodied in large-scale
pre-trained language models and explore how we can finetune them for our two classification
tasks – the detection of sentences with discourse referencing and the classification
of different types of discourse references.
Initial experiments with different models had shown that the transformer language
model BERT [
Devlin et al. 2019], which is pre-trained on the Google Books Corpus and Wikipedia (in the sum of 3.3
billion words), yields the best performances for our two tasks. For efficient computing
and robustness, we use the distilled version of BERT, DistilBERT [
Sanh et al. 2019], for our experiments. DistilBERT uses the so-called knowledge distillation technique
which compresses a large model, called the teacher (here: BERT), into a smaller model,
called the student (here: DistilBERT). The student is trained to reproduce the behavior
of the teacher by matching the output distribution. As a result, DistilBERT is 60%
faster than the original BERT and requires less computing capacities, while retaining
almost its full performance.
DistilBERT – as well as its teacher BERT – makes use of Transformer, a multihead attention
mechanism that learns relations between words in a text. In contrast to other language
models that process a text sequence from left to right, DistilBERT applies bidirectional
training, which means that during training, it reads the entire sequence of words
at once. More specifically, during training the model is provided with sentences where
some words are missing. The task for the model is then to predict the missing (masked)
words based on their given context. By learning to predict missing words, the model
learns about the structure and semantics of a language during the training phase,
which leads to a deeper sense of language context.
For our experiments, we use the pre-trained DistilBERT model from HuggingFace Transformers
[
Wolf et al. 2020] and finetune the training modules on our labelled training data. We use 70% of the
data for training and keep 15% for validation and testing, respectively. We optimize
the model parameters and configurations on the validation set and report results for
the test set. The optimal hyperparameters for our two classification tasks are displayed
in
Table 4. As our output layer we use softmax. This function enables us to interpret the output
vectors of the last layer from the model as probabilities, by mapping them to values
between 0 and 1 that all add up to 1.
|
Task 1 |
Task 2 |
Number of training epochs |
4 |
4 |
Batch size |
16 |
4 |
Learning rate |
5e-5 |
5e-5 |
Table 4.
Hyperparameter setting for DistilBERT.
4.3 Results
For both tasks, when evaluating the two models, respectively, we compare the predicted
labels to the gold version of our annotations. We report results on the test sets
and use the evaluation metrics Precision, Recall and F1 (we report all scores as micro
scores, which means they are weighted according to the label distribution).
|
Input |
Prec |
Rec |
F1 |
Naive Bayes |
Sentence |
80.98 |
79.84 |
80.30 |
DistilBERT |
Sentence |
93.17 |
93.15 |
93.16 |
Table 5.
Results for Task 1: Binary classification between discourse referencing and no discourse
referencing.
Table 5 displays the results for our first task – which was, given a sentence, predict if
this sentence contains a discourse reference or not. We find that both models – Naive
Bayes and DistilBERT – outperform the majority baseline (64.48% for Label 0, No Discourse
referencing) significantly. DistilBERT outperforms our baseline model Naive Bayes
by 13 percentage points (F1 score), which matches our expectations that the latent
linguistic knowledge that DistilBERT stores through its pre-training on large corpora
can successfully be utilized for the task of detecting discourse references in political
debates.
|
Input |
Prec |
Rec |
F1 |
Naive Bayes |
Sentence |
80.55 |
80.86 |
78.81 |
|
Phrase |
82.04 |
82.34 |
80.34 |
|
Sent + phrase |
83.06 |
83.03 |
81.45 |
DistilBERT |
Sentence |
92.44 |
92.44 |
92.41 |
|
Phrase |
97.08 |
96.79 |
96.90 |
|
Sent + phrase |
96.13 |
95.88 |
95.98 |
Table 6.
Results for Task 2, Setting A: Classifying types of discourse references, three classes:
“Actor explicitly mentioned” vs.
“Metonymic mention of the actor” vs.
“No actor mentioned.”
Tables 6–8 display the results of our second task – which was to classify the discourse
references into different categories. The results for Setting A in which we distinguish
the three categories
Actor explicitly mentioned, Metonymic mention of the actor and
No actor mentioned are shown in
Table 3. Both models outperform the majority baseline (59.69% for the label
Actor explicitly mentioned) significantly. For both models, we find that providing the model with relevant phrases
instead of or in addition to complete sentences improves the model’s performance.
The best results for the Naive Bayes model are obtained by combining the sentence
with the relevant phrase as input to the model, while DistilBERT learns best when
provided only with the relevant phrase. This indicates that the models are not always
fully able to detect which parts of the sentences are relevant for classifying types
of discourse references and can benefit from that information when provided with it
as input.
When comparing the scores of the best input formats for each model, again we find
that DistilBERT outperforms Naive Bayes significantly (15.5 percentage points F1 score),
again demonstrating the superiority of pre-trained language models as opposed to knowledge-agnostic
classification models.
|
Input |
Prec |
Rec |
F1 |
Naive Bayes |
Sentence |
79.13 |
79.30 |
79.20 |
|
Phrase |
87.30 |
87.19 |
87.23 |
|
Sent + phrase |
84.63 |
83.95 |
84.14 |
DistilBERT |
Sentence |
92.80 |
92.82 |
92.79 |
|
Phrase |
98.48 |
98.47 |
98.47 |
|
Sent + phrase |
98.01 |
98.00 |
97.99 |
Table 7.
Results for Task 2, Setting B: Classifying types of discourse references, two classes:
“Actor explicitly mentioned” vs.
“No actor mentioned.”
Table 7 displays the results for Task 2, Setting B in which we exclude the least frequently
assigned category
metonymy and only distinguish between the instances of the two classes
Actor explicitly mentioned and
No actor mentioned. We find that the Naive Bayes model significantly improves when excluding the small
class
metonymy (6 percentage points F1 score when provided with the marked phrase), while DistilBERT
improves only by 1.5 percentage points compared to setting A (F1 score when provided
with the marked phrase). Again, we find that providing both models with relevant phrases
instead of complete sentences improves the model’s performance – which especially
applies to Naive Bayes.
|
Input |
Prec |
Rec |
F1 |
Naive Bayes |
Sentence |
80.73 |
80.73 |
80.73 |
|
Phrase |
80.93 |
81.71 |
81.14 |
|
sent + phrase |
82.84 |
81.88 |
82.25 |
DistilBERT |
Sentence |
92.56 |
92.55 |
92.50 |
|
Phrase |
97.83 |
97.82 |
97.82 |
|
sent + phrase |
97.37 |
97.37 |
97.36 |
Table 8.
Results for Task 2, Setting C: Classifying types of discourse references, two classes:
“Actor explicitly mentioned+Metonymic mention of the actor” vs.
“No actor mentioned.”
In
Table 8 we finally display the results for Task 2, Setting C, in which we subsume the category
Actor explicitly mentioned with the category
Metonymic mention of the actor under the main category
actor-naming references and binarily distinguish between the categories
actor-naming references and
No actor mentioned. While the results for DistilBERT stay almost the same as in Setting B, we find that
the performance of Naive Bayes drops drastically (-5 percentage points, F1 score when
provided with a phrase as input). This indicates that the model struggles with the
category
Metonymic mention of the actor – even when this category is subsumed under one label with another category.
To summarize, our results show that both models are able to learn to detect and classify
discourse references in political debates. The trained knowledge-rich model DistilBERT
outperforms the knowledge-agnostic model Naive Bayes significantly on all tasks and
settings. We furthermore find that providing the models with relevant phrases instead
of or in addition to complete sentences improves the model’s performance, which indicates
that the models can benefit from being explicitly hinted at the parts of the sentences
that are relevant for classifying different types of discourse references. It furthermore
shows that those parts of the sentences which are not relevant for distinguishing
between different types of discourse markers are not only useless for the classification,
but even lower the model’s performance.
5. Analysis of results
In this section, we present a deeper analysis of the predictions, performance, and
errors of our best-performing model DistilBERT.
Figure 1 displays the error matrix for Task 1 where DistilBERT achieves a performance
of 93.16 F1 score (cf.
Table 5). We find that the cases where the model predicts a discourse reference but according
to the gold data the respective instance contains no discourse referencing (false
positives, n=37) and vice versa (false negatives, n=29) are almost balanced.
While the manual analysis of the 29 false negatives did not lead to any observation
of linguistic patterns which might lead the model to wrong predictions, the analysis
of the 37 false positives showed that in many cases, DistilBERT predicts a discourse
reference for those instances that mention an actor, but not in a discourse referencing
function such as in example 8 and 9:
8. When it comes to religious constitutional law and legal history at this late hour,
I can understand that Mr. von Notz is not the only one who cannot wait to enter this
debate.
[Wenn es zu vorgerückter Stunde um Religionsverfassungsrecht und Rechtsgeschichte
geht kann ich verstehen dass Herr von Notz nicht der Einzige ist der es gar nicht
abwarten kann in diese Debatte einzutreten.]
9. The Highway GmbH of the federal state examines the facts of the case.
[Die Autobahn GmbH des Bundes prüft den Sachverhalt.]
In both examples, actors are named (Herr von Notz; Die Autobahn GmbH des Bundes), which leads to the assumption that the model interprets explicit mentions of actors
as indicators for discourse referencing.
Figures 2–4 display the error matrices for the different settings of Task 2. Since
the performance of DistilBERT on Task 2 is very high in all three settings, we find
only very few errors. A systematic manual analysis of the misclassified instances
revealed three main sources of errors:
Error type 1: The model confuses the labels actor and metonymy (Setting A)
One common error in setting A is that the category
Actor explicitly mentioned and the category
Metonymic mention of the actor are confused by the model. (
10) displays an example that mentions an actor only m
etonymically according to the annotation guidelines but is misclassified by DistilBERT (with all
three input options) as an instance that
explicitly mentions the actor.
10. Our Basic Law also protects freedom of occupation in Article 12.
[Unser Grundgesetz schützt in Artikel 12 auch die Berufsfreiheit.]
A reason for this type of error may be the small size of the class Metonymic mention of the actor, as it accounts for only 3.05% of the annotations in the gold standard. In the following
discussion, we will also reflect on the distinctiveness of these two classes. This
error type confirms our choice of Setting B and C, where metonymy is either excluded
(B) or subsumed together with the frequent category Actor explicitly mentioned under the main category actor-naming references (C).
Error type 2: An actor was predicted when there were none
Similar to the first type of error, the model misclassifies instances as belonging
to the class of actors being explicitly mentioned, whereas according to the gold standard,
no actor is mentioned. An explanation may be the mentioning of actors that are not
part of the discourse reference made (e.g.,
“The Bundestag” and
“US President Donald Trump” in (
11)), or the use of pronouns (
we in (
12)). This assumption is reinforced by the fact that this error mostly occurs when sentences
build the input. When providing the model with a phrase (underlined in the examples),
which usually does not contain a named entity/pronoun, the model makes correct predictions.
11. The Bundestag would be well advised to take this admonition to heart, also in order not to run the risk of being identified with a policy of
racist claims of superiority against China, such as that put forward by former US President Donald Trump.
[Der Bundestag wäre gut beraten, sich diese Mahnung zu Herzen zu nehmen, auch, um nicht Gefahr zu laufen, mit einer Politik des rassistischen
Überlegenheitsanspruchs gegenüber China, wie sie der vormalige US - Präsident Donald Trump nach vorne stellte,
identifiziert zu werden.]
12. Unfortunately, we are increasingly seeing negative aspects of our digital world with
disinformation and hate speech.
[Leider sehen wir mit Desinformation und Hassrede vermehrt auch negative Aspekte unserer digitalen Welt.]
Error type 3: Metonymy is only recognized when the model is provided with a phrase
Lastly, we also find several cases where the model only predicts the category
Metonymic mention of the actor correctly when provided with a phrase instead of the complete sentences, an example
is given in (
13). This error again emphasizes the importance of hinting the model to specific phrases
for the detection and classification of discourse references, by providing it only
with the phrase that has been marked manually as relevant for discourse referencing
as input, as described above.
13. On average, women do 1.5 hours more work a day in the household and raising children
than their partners - that’s what previous surveys tell us - and in return, they can work fewer hours.
[Frauen leisten im Schnitt täglich 1,5 Stunden mehr Arbeit im Haushalt und bei der
Kindererziehung als ihre Partner – das sagen uns die bisherigen Erhebungen – und im Gegenzug können sie weniger arbeiten gehen.]
6. Discussion
First of all, it should be emphasized that the results can be considered very encouraging:
The very high F1 values indicate the robustness of the category system and the high
quality and homogeneity of the annotations. Not surprisingly, the results of the machine
learning experiments show that the pre-trained BERT model outperforms the Naive Bayes
model. This can be traced back to the fact that while traditional statistical models
such as the Naive Bayes model are solely trained on the labelled training data, BERT
is pre-trained on large amounts of data and then fine-tuned on the labelled training
data, which makes it a knowledge-rich model. This aligns with the observation in various
other NLP tasks such as sentiment analysis, text classification or summarization,
where BERT (and other Large Language Models such as XLNet or GPT) usually outperform
traditional statistical models (cf. [
González-Carvajal and Garrido-Merchán 2020]).
In our experiments, especially in the rarer and more difficult category of metonymic
actor mentions, the pre-trained model BERT performs well, while this more fine-grained
distinction causes difficulties for the untrained Naive Bayes model. In addition to
this content-categorical granularity, both models benefit from the higher granularity
of segmentation. Phrase-accurate annotation produced better results than annotation
with sentence-only segmentation. Thus, the attempt to introduce a kind of human “attention mechanism” into annotation has shown to be successful.
Concerning the category development, we observe how important it is to focus on the
interplay between form and meaning – between segment size and conceptual granularity
of categories: We showed that the annotation on smaller, customized segments that
precisely indicate instances of categories improves the pre-trained BERT model’s performance
in detecting even fine-grained conceptual categories. In contrast to the larger and
standardized segment “sentence,” the model could also learn differentiated category systems with high performance
based on customized extracted phrases. The greater formal precision relieves the model
of the multi-dimensional and highly complex inferential processes involved in human
language understanding. In contrast, when categorizing based on sentences as input,
the model must in principle mimic the full complexity of human language comprehension.
Against this background, we present a review of the course and results of the annotation
workflow: The lowest inter-annotator agreement value was obtained for +/- discourse
referencing, the least granular and, at first glance, the simplest distinction.
Since this presumably simple binary classification task was performed at the segmental
level of sentences, the full range of linguistic, contextual, and also domain knowledge
was required for classification. Even if the indicators were described as precisely
as possible in the guidelines, the high variation of the form-function correlation
still requires pragmatic consideration in most cases. This can only be done properly
based on expert knowledge, which is acquired in the practice of everyday academic
life. Accordingly, uncertainties and misunderstandings arose among student annotators,
which could not be clarified by the guidelines alone, but by training and joint practice.
Declarative factual knowledge is therefore not sufficient for such a classification
task; procedural expert knowledge, as it were, is required.
7. Conclusion
With a focus on linguistic routines of discourse referencing, we conducted a collaborative
annotation on a sub-corpus of the plenary minutes of the German Bundestag in two steps:
First, we performed a binary classification task (+/- discourse referencing). Second,
we classified mentions of actors according to a three-item tagset (explicit/metonymic/none).
Additionally, we extracted phrases that were identified to indicate discourse referencing.
We then ran machine learning experiments with probabilistic and neural classifiers
on our annotated dataset as training data. In these experiments, we tested the effect
of different types of input data in terms of taxonomies (number of categories) and
segment sizes (phrase input vs. sentence input). Our study has shown that the pre-trained
neural transformer language model BERT achieves impressive learning results when provided
with data annotated according to our category system.
It has been demonstrated that a more fine-grained segmentation on the linguistic surface
(that means, the manual selection of relevant phrases) improves the model performance.
This suggests that if fine-granular operationalization of pragmalinguistic phenomena
in terms of indicators on the linguistic surface is possible, high machine learnability
is achievable – probably even for more fine-grained as well as context- and background-knowledge-dependent
categories. To summarize, our results show that the recognition and categorization
of different types of discourse references can be modelled automatically with neural,
knowledge-rich models.
In plenary debates, as our studies indicate, these practices of discourse referencing
play an important role and are frequently applied. However, we believe that our methodological
findings can be generalized to other text genres as well as to other complex linguistic
categories. As a conclusion and reflection of our category system development process,
it can be summarized that both the performance of the algorithmic models and the human
inter-annotator agreement were positively affected by the refinement and specification
of the segmentation. A prerequisite for this was the more precise operationalization
of the phenomenon under investigation, i.e., the elaboration of more specific indicators
on the linguistic surface that can be captured at the level of phrases. This was accompanied
by an increase in the degree of granularity of the conceptual categories. Here it
is necessary to find the right balance, depending on the object of investigation –
also with regard to the machine and human learnability of categorization. An important
part of the human learning process in the study took place in the course of the successively
more precise operationalization, explication and description in guidelines as well
as the accompanying meta-discussion among the annotators. Thus, the initially unclear
scope of interpretation depth was gradually resolved by stronger operationalization
and by explicit interpretation criteria. We consider this point to be the central
success factor and key to collaborative category development and annotation with a
view to automation.
Works Cited
Archer et al. 2008 Archer, D., J. Culpeper and M. Davies (2008) “Pragmatic Annotation,” in Corpus Linguistics: An International Handbook, pp. 613–641.
Artstein and Poesio 2008 Artstein, R. and M. Poesio (2008)
“Inter-Coder Agreement for Computational Linguistics,” Computational Linguistics, 34(4), pp. 555–596. Available at:
https://doi.org/10.1162/coli.07-034-R2.
Becker et al. 2020 Becker, M., M. Bender and M. Müller (2020)
“Classifying heuristic textual practices in academic discourse: A deep learning approach
to pragmatics,” International Journal of Corpus Linguistics, 25(4), pp. 426–460. Available at:
https://doi.org/10.1075/ijcl.19097.bec.
Becker et al. 2021 Becker, M. et al. (2021)
“CO-NNECT: A Framework for Revealing Commonsense Knowledge Paths as Explicitations
of Implicit Knowledge in Texts,” in
Proceedings of the 14th International Conference on Computational Semantics (IWCS).
IWCS 2021, Groningen, The Netherlands (online): Association for Computational Linguistics,
pp. 21–32. Available at:
https://aclanthology.org/2021.iwcs-1.3 (Accessed: 24 July 2023).
Bender 2023 Bender, M. (2023)
“Pragmalinguistische Annotation und maschinelles Lernen,” in S. Meier-Vieracker et al. (eds)
Digitale Pragmatik. Berlin, Heidelberg: Springer (Digitale Linguistik), pp. 267–286. Available at:
https://doi.org/10.1007/978-3-662-65373-9_12.
Bender and Müller 2020 Bender, M. and M. Müller (2020)
“Heuristische Textpraktiken. Eine kollaborative Annotationsstudie zum akademischen
Diskurs,” Zeitschrift für Germanistische Linguistik 48 (1)/2020: 1-46. DOI:
https://doi.org/10.1515/zgl-2020-0001.
Devlin et al. 2019 Devlin, J. et al. (2019)
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in
Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers).
NAACL-HLT 2019, Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186.
Available at:
https://doi.org/10.18653/v1/N19-1423.
Evert and Hardie 2011 Evert, S. and A. Hardie (2011)
“Twenty-first century Corpus Workbench: Updating a query architecture for the new
millennium,” in
Proceedings of the Corpus Linguistics 2011 Conference.
Corpus Linguistics 2011, University of Birmingham, GBR. Available at:
https://eprints.lancs.ac.uk/id/eprint/62721/ (Accessed: 24 July 2023).
González-Carvajal and Garrido-Merchán 2020 González-Carvajal, S. and E. C. Garrido-Merchán (2020)
“Comparing BERT against traditional machine learning text classification.” Available at:
https://doi.org/10.48550/ARXIV.2005.13012.
Greve and Wentura 1997 Greve, W. and D. Wentura (1997) Wissenschaftliche Beobachtung: eine Einführung. [Scientific Observation: An Introduction]. PVU/Beltz.
Hardie 2009 Hardie, A. (2009)
“CQPweb - Combining power, flexibility and usability in a corpus analysis tool,” International Journal of Corpus Linguistics, 17. Available at:
https://doi.org/10.1075/ijcl.17.3.04har.
Jurafsky and Martin 2022 Jurafsky, D. and J. H. Martin (2022)
Speech and Language Processing. An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. III edn. draft. Stanford. Available at:
https://web.stanford.edu/~jurafsky/slp3/ (Accessed: 24 July 2023).
Müller 2007 Müller, M. (2007)
Geschichte - Kunst - Nation: Die sprachliche Konstituierung einer ‘deutschen’ Kunstgeschichte
aus diskursanalytischer Sicht. Berlin, New York: De Gruyter. Available at:
https://doi.org/10.1515/9783110969436.
Müller 2022a Müller, M. (2022a)
“Die Plenarprotokolle des Deutschen Bundestags auf Discourse Lab,” Korpora Deutsch als Fremdsprache, 2(1), pp. 123–127. Available at:
https://doi.org/10.48694/KORDAF-3492.
Müller 2022b Müller, M. (2022b)
“Discourse Lab – eine Forschungsplattform für die digitale Diskursanalyse,” Mitteilungen des Deutschen Germanistenverbandes, 69, pp. 152–159. Available at:
https://doi.org/10.14220/mdge.2022.69.2.152.
Müller and Stegmeier 2021 Müller, M. and J. Stegmeier (2021)
“Korpus der Plenarprotokolle des deutschen Bundestags. Legislaturperiode 1–19. CQPWeb-Edition.” Darmstadt: Discourse Lab. Available at:
https://discourselab.de/cqpweb/.
Orbach and Goldberg 2020 Orbach, E. and Goldberg, Y. (2020) “Facts2Story: Controlling Text Generation by Key Facts,” Proceedings of the 28th International Conference on Computational Linguistics, pp. 2329–2345, Barcelona, Spain (Online). International Committee on Computational
Linguistics.
Pedregosa et al. 2011 Pedregosa, F. et al. (2011) “Scikit-learn: Machine Learning in Python,” The Journal of Machine Learning Research, 12(null), pp. 2825–2830.
Peters et al. 2018 Peters, M. et al. (2018) “Deep Contextualized Word Representations,” Proceedings of the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for
Computational Linguistics.
Pierce 1903 Pierce, C. (1903) CP 7.677.
Sanh et al. 2019 Sanh, V. et al. (2019)
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” in.
The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing. Co-located
with the 33rd Conference on Neural Information Processing Systems NeurIPS 2019, arXiv, pp. 1–8. Available at:
https://doi.org/10.48550/ARXIV.1910.01108.
Schmidt et al. 2021 Schmidt, T., K. Dennerlein and C. Wolff (2021)
“Emotion Classification in German Plays with Transformer-based Language Models Pretrained
on Historical and Contemporary Language,” in
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Literature.
LaTeCHCLfL 2021, Punta Cana, Dominican Republic (online): Association for Computational Linguistics,
pp. 67–79. Available at:
https://doi.org/10.18653/v1/2021.latechclfl-1.8.
Weisser 2018 Weisser, M. (2018)
How to Do Corpus Pragmatics on Pragmatically Annotated Data,
Studies in corpus linguistics 84. Amsterdam, Philadelphia: John Benjamins Publishing Company. Available at:
https://benjamins.com/catalog/scl.84 (Accessed: 24 July 2023).
Wolf et al. 2020 Wolf, T. et al. (2020) “Transformers: State-of-the-Art Natural Language Processing.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, p. 38–45, Online. Association for Computational Linguistics.
Yang et al. 2020 Yang, Z. et al. (2020)
XLNet: Generalized Autoregressive Pretraining for Language Understanding. Online Resource:
https://arxiv.org/abs/1906.08237.
Yin and Chang 2020 Yin, D., T. Meng and K.-W. Chang (2020)
“SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment
Semantics,” in
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
ACL 2020, Online: Association for Computational Linguistics, pp. 3695–3706. Available at:
https://doi.org/10.18653/v1/2020.acl-main.341.