Zahra Rizvi is Ph.D. scholar at the Department of English, Jamia Millia Islamia, Delhi, India. Her research interests include utopia/dystopia studies, digital media, young adult participatory spaces, and ethics of care in/and play. She is a founding-member of the Indian chapter of Digital Games Research Association (DiGRA). She was recently Ministry of Education-SPARC Fellow in Digital Humanities and Digital Cultures at Michigan State University, and is a member of the Digital Humanities Research Group at Department of English, Jamia Millia Islamia. Her work has been published in several online and print journals.
Rohan Chauhan has been trained in Comparative Literature at the Department of Modern Indian Languages and Literary Studies, University of Delhi. He is presently learning about various interfaces between literature and history in the print cultures of colonial North India on a Junior Research Fellowship from UGC. His interests include technologies that support textual studies in the digital age.
A. Sean Pue is associate professor of South Asian Literature and Culture at Michigan State University.
Nishat Zaidi is Professor and former Head, Department of English, Jamia Millia Islamia, New Delhi. Her publications include her monographs, Makers of Indian Literature: Agha Shahid Ali (Sahitya Akademi 2014), Terrains of Consciousness: Multilogical Perspectives on Globalization (Wurzburg University Press, 2021. Her forthcoming work is Karbala: A Historical Play (translation of Premchand’s Play Karbala with a critical introduction and notes, OUP, 2022) and her monographs, Dreaming of the Digital DIvan: Digital Apprehensions of Poetry in Indian Languages (with A Sean Pue et al. Bloomsbury 2022), and Ocean as Method: Thinking with the Maritime (with Dilip Menon et al, Routledge, 2022).
This is the source
This paper explores multilingual minimal computing and plain text for Indian literatures. It focuses on our workflow designed to produce multilingual, annotated digital critical editions of Indian-language poetry, and to model, explicate, and visualize their poetics. In the absence of digital scholarly corpora, resources developed by citizen scholars working outside of academia are essential; for our team and audience, this includes free and open source solutions — including optical character recognition tools — developed in other contexts. Modeling formal, metrical, thematic, and rhythmic structures opens up the possibility for computer-assisted scholarly analysis across the variously related languages and literary histories of India, which are usually treated in isolation. Positioning our work as a form of minimal computing, we discuss our workflow as a
Exploring minimal computing as a method for creating multilingual, digital critical editions of Indian-language poetry.
India is a multilingual society with hundreds of years of continuous literary traditions
in dozens of languages, some stretching as far back as two and a half millennia. We use
India
here not only to refer to the contemporary nation-state but also to the greater
Indian subcontinent, taking note of the complex, overlapping territorial histories of
the premodern Hind
or Hindustan,
the des
or desh,
British India,
and Cold War
South Asia.
Additionally, we acknowledge the interconnectedness of India and its
diaspora — the world's largest — with other regions. For any project on Indian
literature(s), it makes more sense to speak of multiple literary traditions instead of
one. When we speak of Indian poetics,
we are actually referring to the longstanding
traditions of poetics in what come to be understood in the modern period as distinct
literary traditions in multiple languages, such as Hindi, Urdu, Bengali, Marathi, Tamil,
and Telugu among many others, which cumulatively stand for Indian Literature(s).
While the specificities of literary traditions are of interest in themselves, we choose
to emphasize commonalities and interactions in various poetic traditions of modern
Indian languages by reading them comparatively.
Indian literature, imagined this way, is an enormously broad conception that demands philological skills that exceed the capacity of any single individual. Therefore, for our research project, we started with Hindi and Urdu, two languages with which our larger research group, consisting of students and faculty in India and the United States, generally felt some familiarity. These languages have an entangled yet estranged relationship, and they are the two Indian languages, other than English, that our multilingual team knows the best. It is not unusual in India for a person to know two or three languages well enough to share in these literary traditions partially. Languages such as Punjabi and Bengali have some common sources of vocabulary and generally similar grammars to Hindi and Urdu but are written in different scripts with varying pronunciations. Other languages, such as Malayalam, have, in addition to script and phoneme differences, fully divergent grammars — Dravidian rather than Indo-Aryan — though they also incorporate some common sources, such as Sanskrit.
Hindi and Urdu share a common ancestry: the North Indian speech of the Delhi area.
Literary histories of Urdu
and Hindi
name this common ancestor as
Hindi/Hindui/Hindavi. But they note, this name could be Persian for
undue — and sometimes even almost mindless emphasis on ‘correct’ or ‘standard, sanctioned’ speech in poetry and prosesometime in the eighteenth century, perhaps due to the growing prestige of Persian, it was the British bias in language policy that eventually led to the separation of the North Indian speech into two separate languages defined along religious lines
Urduand its literature as foreign to South Asia because of its
Persianlanguage elements and metaphors
Our collaborative project between the Department of English at Jamia Millia Islamia and Michigan State University, titled
In what follows, we expound on our desired outcomes, approach, and architecture. As
noted above, we aim to develop digital critical editions and datasets of annotations,
including poetic keywords. We approach corpora development as a form of making rooted in
the principle of
Critical editions have long been a central component of traditional humanities, primarily
for framing our perception of history, literature, art, thinking, language
by
establishing reliable sources for research
and authorizing and canonizing certain
readings
interactivity, multimedia, hypertext, and immaterial and highly dynamic (or
fluctuating) ways of representing content,
which are absent from printed critical
editions laboratory where the user is invited to work with
the text more actively,
with the help of integrated features and tools allowing for
customization, personalization, manipulation and contribution
Through such digital critical editions, our aim is not merely to bring the past into
the future,
but also to furnish it for newer modes of computational inquiries
We draw from Raymond Williams’ method of developing accounts of words as reflective
essays
for Indian poetics in order to make sense of our textual corpus fixing (their)
definition
; rather, like Williams, we hope to explore the complex uses
of a variety
of conceptual categories to convey the contested nature of their meanings as clearly and
succinctly as possible
While we want to make this resource accessible to a general audience, we are also keen
on designing it for use by specialists and teachers and students in the classroom, as well as in computational or traditional research, by providing a snapshot of the historical
evolution of the semantic fields associated with these terms. While the historical
variations of meaning might not be as evident for topics pertaining to prosody, we
believe this might be particularly useful for pinning down both the history of words
and the contestation of their meanings
for technical poetic vocabularies in different
literary traditions over generations, eras, and epochs, acknowledging their
discontinuities, ruptures, erasures, and reconstructions, especially in moments of
political and social upheaval
Because of the absence of existing Indian language digital corpora, especially for
literary texts, making
became a necessary component of our research. Making has
undoubtedly been central to digital humanities and, as recent debates on the issue have
clarified, doesn’t necessarily have to begin from scratch
but instead can start in
media res
critical making,
in
Matt Ratto’s sense of the term, not only opens the disciplinary boundaries of digital
humanities for introspection but also enriches it by drawing upon practices unfamiliar
in its predominantly Anglo-American roots.
For us, critical making also entails localization, whereby we hope to develop, reuse,
and repurpose open-source tools and technologies for the particularities of Indian
languages and scripts. Our conception of making is rooted in reconfiguring materialities
of existing tools to overcome obstacles
and find solutions
particularly north of the Vindhya Mountain Range,
which refers to
the territories associated with the Indo-Aryan languages in which the word occurs, and
testifies to the prevalence of this word in a large Indian populace, given that the
Vindhya Mountain Range essentially splits the country in half techno-myth
— alongside other cultural
practices of making in the Global South that Ernesto Oroza would deem technological
disobedience
misery is not an
alternative
for us, as we are motivated to engage with hybrid
means to achieve our
scholarly objectives
For example, our initial efforts to localize open-source OCR tools for Indian languages
take advantage of two ongoing development efforts in France and Germany. First, we
focus on creating ground truth corpora to provide accurate transcriptions for training
and testing of both automatic text recognition (ATR) and automatic layout analysis (ALA)
models for historical documents in Indian languages using eScriptorium. eScriptorium is
an open-source tool, currently under development for hand-written text recognition (HTR)
at École Pratique des Hautes Études, Université Paris Science et Lettres (EPHE – PSL),
that adapts well for bidirectional scripts. It enables easy application of
state-of-the-art neural networks for transcribing and annotating historical documents
using an intuitive graphical user interface in modern web browsers. Developed as part of
the Scripta project
We also draw from ongoing developments within the context of the DFG funded OCR-D
project
Working with historical documents in different layouts and typefaces, this part of our project will create a publicly available textual corpus in Hindi and Urdu that can be further processed downstream for a host of data-driven humanistic inquiries. These include information extraction methods, such as named entity recognition. Our objective is to work primarily with publicly available digital images accessed through International Image Interoperability Framework (IIIF) API endpoints hosted at different cultural institutions, including the British Library and Internet Archive. Additionally, we will publish well-documented state-of-the-art OCR models in public repositories, along with their corresponding provenance and ground truth datasets, for use in further development of open source OCR for historical print in Hindi and Urdu.
In the absence of robust institutional support, citizen scholars around the world have done much of the work to make Hindi and Urdu texts available on the Internet. Care for languages and poetic experience has led impassioned citizen scholars to launch digital platforms such as Rekhta (Urdu), KavitaKosh (Hindi), and PunjabiKavita (Punjabi). The reach of these projects, especially Rekhta, is enormous, spanning from social media like Twitter, Facebook, and Instagram, to festivals and conferences, to classrooms, to more personal spaces such as family chat groups. Though these platforms provide enormous resources to scholars, their primary and intended audience is not the academy. Our collaboration intervenes here to provide access to these texts as data for an audience that includes the quotidian Indian as well as scholars, poets, information technologists, library systems, and nonhuman agents. Proper metadata is essential for such aims. Using linked open data, we incorporate citizen-scholar projects and resources into our projects as well.
Citizen scholarship is only one example of the changes in the portability of texts
through new digital technologies. The malleability and flexibility of digital texts,
coupled with their easy accessibility and reproducibility, make them increasingly
desirable for scholarly inquiry. These affordances motivate our focus on the production
of high-quality, accessible digital editions. While learning from community-driven
initiatives by citizen-scholars, our objective is to digitally remediate the critical
editions of individual poets’ corpora in Indian languages, using minimal markup schemes
such as Markdown that are easy to reproduce for anyone willing to decode, receive, and
revise
them
As we will describe below, we follow this model of using plain text in human- and machine-readable Markdown files. While the interface we use to access this data may change, the form of the data, a simple text file, is remarkably versatile. The transformation from Markdown to Textual Encoding Initiative (TEI) XML, too, can be a simple and automatable procedure. However, we require additional layers of annotation, especially for poetry, as well as a multilingual interface.
Prefaced by traditions of localization and open-source activism within India, such as
those of Delhi's Sarai program
We use Git to organize and collaboratively write and contribute to our project, as it
makes both minimal dependence and minimal maintenance possible. Though used primarily to
version computer code, we use Git both in software development and in our other
collaborative efforts, including writing. It allows us to avoid dependency on research
computing professionals or buying expensive, high-maintenance technology. Minimal
computing helps us avoid the alienation of users or the fetishization of tools. While a
certain amount of code goes into any GitHub or GitLab project, it makes entry into
project development possible and accessible for beginners.Hello World
GitHub
Guides. Nelson’s tutorial was used by our team members because it has
accessible slides and cheat sheets, with reduced technical language.
As noted, we turn to adaptation and localization of existing tools for our collaborative writing workflow instead of reinventing the wheel. Specifically, we use Jupyter Book, part of the Executable Book Project, which enables users to assemble books and articles from a combination of Markdown files, executable Jupyter Notebooks and other sources using Sphinx, the robust documentation engine developed for Python. Though originally used for collaborative computing in STEM fields, we find the executable notebook and book approach of great interest for our digital humanities work, as it allows code to be embedded alongside text. Sharing the source files allows for one team member's work to be reviewed by others, permitting open access to any calculations and visualizations as well as the techniques used to render them. By utilizing Sphinx, Jupyter Book allows these various documents to be easily combined and cross-referenced. Like with Markdown, the text can be converted into HTML for website viewing, as well as into other formats.
In our workflow, we had to make some adjustments to attend to the specific requirements
of Indian languages and of our multilingual audiences, as well as to the needs of our
larger research group. In Unicode, the standard system used to digitally encode the
world's writing systems, only the orthographic information about a word can be encoded.
Unlike in English, there can be more than one way to write certain ligatures in both
Devanagari and Nastaliq scripts, which therefore requires some normalization.
Transliteration between scripts, metrical analysis, lexicography, and interfacing with
information systems, however, requires additional layers. For example,
there is simply not enough information outside of context to distinguish between کیا
(
To address these issues, we took advantage of two newer developments: the JAMstack and
using Git as a content management system.JAM
here refers to JavaScript, API,
and Markup; these are the client-side
) and as a local or
remote server (server-side
). In the JAMstack, an external API (A
) can be accessed
in Javascript and used to populate a website with data. Individual content pages,
such as blog posts, are usually written in Markdown for the Markup
(M
). Server-Side Rendering
(SSR). SSR is advantageous from a search
engine optimization perspective. As the web crawlers of Internet search engines visit a
page, they learn not only that there's a page at a given address but also what its
contents are. Another advantage is that pages can be loaded very quickly because the
pages are pre-generated and static. Finally, these websites are usually relatively
small and can often be hosted for free on GitHub, GitLab, Netlify, or other web
providers.
While there are several popular JAMstack frameworks, all of which could have handled the tasks at hand, we were attracted to the Vue.js JavaScript framework and within it the Gridsome JAMstack framework. Vue.js recommends a "single page component" approach to web design, whereby code, template, and webpage style declarations are all kept together in one file. We decided to embrace this new approach, hoping it would help us get our projects going quickly. Gridsome uses Vue.js and adds additional features, such as a GraphQL (Graph Query Language) interface to query collections. It relies on plugins, which can import data from different sources. Commonly used sources include hidden or public blogs, Drupal websites, and, as in our case described below, a directory of plain text Markdown files. In addition to markup of text, Markdown files can also contain data of nearly any sort in their header. The following example shows a sample header encoded in the YAML format, a human-readable way of storing or transmitting data:
body )
The sample above includes a title
and an author
field. These fields, which usually
provide metadata about the text in the body
of the Markdown file, can be used to store
numbers, dates, tags, and nearly any sort of data.
The second trend that we adopted is the move to use Git not only to write or code together but also as a content management system. Here, we used the open-source, JavaScript-based Netlify CMS, which is written using React, an alternative to Vue.js. Easily added to Gridsome and other frameworks, Netlify CMS allows users to authenticate with a Git repository — we used GitLab — and make and commit changes to the repository via a web-accessible editor page. A plugin to Gridsome adds a route to a webpage where users can edit the Markdown header fields online according to a configuration that they specify; we can determine what their content should be (e.g., dates, strings, numbers, lists of strings, or references to other nodes). Netlify CMS then handles the updates to the Git repository. As a result, we are able to have people access and update the data without requiring them to install the full JAMStack on their local computer.
We also knew we wanted to have certain content and data be available in Hindi, Urdu, and
English. Fortunately, we were able to adapt the internationalization (i18n) features of
Netlify CMS to work with those of Gridsome. Netlify CMS handles translation of Markdown
header and body fields by keeping certain common fields in the default locale — we chose
English (en
) for this project — and by storing localized
(translated) fields in
other locales, as in this YAML header:
In this sample from an author
collection, the field bday
only appears in the default
locale (en
). The fields that can be translated (title
and body
) appear also in the
Hindi (hi
) and Urdu (ur
) locales. Note that the body
field, which followed the
header in the previous Markdown header example, is now contained as a field within the
header itself.
The Markdown file of a text can then reference its author(s) using the author
field:
In this example, we specify that there can be more than one author,
hence that field
contains a list, indicated in YAML by the hyphen. The author is referenced using the
Roman version of the poet's name. This allows for a fully human- and machine-readable
version of this document. In this way, the plain text files in the directory of a Git
repository are treated as a document-oriented database. (Gridsome, in fact, uses the
speedy JavaScript LokiJS database internally). When displayed, fields are presented on a
webpage in accordance with the client's chosen locale — Hindi, Urdu, or English.
Through this combination of the JAMstack and a Git repository content management system, we can provide localized access not only to viewers but to our contributors, even if they are on a mobile phone or tablet without proper access to a computer. By using continuous integration — in our case, the running of a script when a commit is made to the Git repository — the website is automatically updated when changes to the content are made, and tests are run to assert that the changes are valid. Updates to the data can also be federated; by adjusting the Netlify CMS settings to use permission levels in Git, some users can automatically make changes while others require approval. These proposed changes can come from the public, too, offering a straight-forward pathway to crowdsourcing.
For web-based annotation, we are developing a custom widget that starts from the
transcription of the text in its original script (OCRed, if appropriate) stored in the
body
field of a Markdown file. A genre
field determines how exactly the text will be
treated. In general, we address the location of the individual tokens/words by their
coordinates in the Markdown file. This allows us to add multiple layers of annotation.
The transcribed words are treated as phrases
that can contain multiple words.
A
name, for example, can be a phrase, but so can a compound word. (Linguists also prefer
to have some features, such as future case markers, separated.) Sentences are stored as
a span of coordinates (e.g., for poetry indexed by line group, line, and phrase) after
we split the original text between spaces, punctuation, and paragraphs. While editing,
changes to a custom widget are mapped to a model representation of the text in the web
browser and then written to disk following updates.
We are also able to produce a view of the text in the Textual Encoding Initiative (TEI)
XML, which is widely used in digital humanities. Sentences map to TEI’s
element, phrases to the <s>
element, and words are its <phr>
element. In
this way, we avoid the awkwardness of dealing with right-to-left text in XML editors.
Individual words or phrases, moreover, can have additional views or links attached to
them, such as scholarly or library-system transliteration or the International Phonetic
Alphabet. Also, we can offer views of the individual sentences using the CoNLL-UL format
used by Universal Dependencies (UD), a framework for grammatical annotation, allowing us
to take advantage of the rich set of annotation tools developed for UD.<w>
For the purposes of developing an annotated corpus, this Bol
(S) hasAuthor
(P) Faiz Ahmed Faiz
(O). The
document
or node
serves as a subject
; the metadata field (e.g., author:
) as the
RDF predicate
(e.g, hasAuthor
); and the value (e.g., Faiz Ahmed Faiz
) as the
object.
In Gridsome, the entire network graph of relations is available both through
the query language GraphQL and as a database. Finally, the whole system can be accessed
and updated either on a local machine that has the JAMstack implemented or through a web
interface to Git using Netlify CMS. While the former is especially useful for those of
us doing computational analysis, the latter allows updates to be made from any phone.
The data can be easily versioned and progressively archived from Git to Zenodo or other
repositories, and the interface expands to meet our needs.
Plain text is not only simple and dynamic in its potential but also allows authors and
researchers to have more creative and scholarly freedom over their work that proprietary
tools and formats often lack. If one hopes to make understanding of Indian poetics
sustainable and accessible, such minimal computing approaches reduce the use of
proprietary technologies and paywalls to increase access to content, data, and/or source
files
ecological cost of storage for digital preservation
and accessibility issues for users in low bandwidth regions
Above we have described a collaboration to enable data-intensive textual study of Indian
languages. In the absence of existing digital corpora, especially literary corpora, we
have turned to making our own, starting with OCR workflows. To do so, we have embraced a
form of innovation under constraint by reusing or repurposing what is at hand, which is
commonly referred to in North India as