Mark Hedges is Deputy Director of the Centre for e-Research at King's College London, and before that of the UK Arts and Humanities Data Service. He is active in a number of research and development projects related to e-research infrastructures, digital repositories, digital libraries and virtual research environments, especially but not exclusively in the humanities. Prior to becoming involved in e-research, he worked for 17 years in the software industry, taking the technical lead on a number of large-scale development projects for industrial and commercial clients. His academic background is in mathematics and philosophy and, more recently, in Byzantine studies.
This is the source
The term grid-enabling
is sometimes (or even often) used without a
clear idea of what is meant. In this article we attempt to clarify
some of the possible meanings of grid-enabling data resources. In
particular, we examine how researchers in the humanities may benefit
from using such approaches, and examine some concrete case studies in
which grid technologies have been used to support data-driven research
in the humanities.
Clarifying some of the possible meanings of grid-enabling
with case studies from data-driven research in the humanities.
Like many vogue expressions of technological origin, grid-enabling
is a term that is sometimes (or even often) used without a clear idea of what is meant. While it may seem inappropriately techno-centric to take as point of departure a technology rather than the issues that one might address with the technology, the uses and abuses of this expression may justify this inversion as a way of clearing the ground of confusion. To begin with, then, let us consider for a moment what we mean when we talk about grid-enabling data resources, before we move on to the more specific topic of humanities data, and provide some concrete case studies.
Grid computing has been defined in a variety of ways, encompassing a variety of
types of distributed computer systems.utility
or cloud
computing
Irrespective of these definitions or metaphors, when people speak of grids they
frequently mean computer systems that have in common the fact that they are powered by one or more of a particular group of technologies, which fall under the category of grid middleware.
In this context, to grid-enable data resources means to make them available via a grid, or by using some form of grid middleware. But this technical answer doesn’t help us much in understanding what we mean. Unlike technologists, researchers do not in general make the effort to use such technologies for their own sake. We need rather to look at the question of what we want to achieve by making resources available in such a fashion, and what researchers and other users will gain thereby. So what can we enable by grid-enabling data? One basic step is to enable sharing of a data resource, widening its availability and increasing its use and effectiveness. For instance, a data resource may be held on a departmental or institutional file system, or in the deep web,
that is to say regions of the World Wide Web that for one reason or another are not accessible by search engines, and thus while in principle accessible are not discoverable using generally applied techniques. Resources of this kind can be shared by putting them on a publicly available website within the reach of web crawlers, or in an institutional repository, which can themselves be opened up to web indexing agents. Such access is a necessary precondition, though not in itself sufficient to constitute true grid-enablement.
More significant possibilities exist in expanding research capacity, through the virtualisation of data and access to data, by means of abstracted and standardised interfaces and protocols. This virtualisation may operate with respect to a number of different factors:
Virtualisation can hide irrelevant
(for whatever purpose we have in mind) differences between data resources, giving the user more seamless access to them. Distributed, autonomous and heterogeneous datasets can be federated and regarded as a single resource, enhancing the visibility of the data and multiplying the uses to which it can be put.
In order to see what grid-enabling may mean, and what it can do for the humanist, let us examine some applications of grid technologies in scientific disciplines, where they have been more widely used.
The earlier focus of grid technologies was a computational one, enabling the
distribution of very large computational tasks and simulations in the big
sciences, such as physics or the environmental sciences. The grid-enabling of data in these disciplines was concerned to a great extent with enabling fast access to and transfer of very large data sets distributed over multiple, collaborating research centres.
The capture of extremely large data sets and their availability via the Internet
has also led to a new model of carrying out scientific research. Instead of the
direct observation of natural phenomena (experimental science), the creation of
(usually) mathematical models accounting for natural phenomena (theoretical
science), or the generation of knowledge through large scale simulations
(computational science), the researcher works on, gains knowledge from and adds
value to already existing collections of digital data. We meet in particular the
model of the in silico experiment (Goble et al. 2006),
where data from multiple, distributed and heterogeneous sources is processed by
software tools to create new information and generate new discoveries, for
example, sky survey data held in virtual
observatories
These examples are apposite for the humanities. On the one hand, increasingly numerous corpora of data are being produced by research projects in the humanities, as well as by digitisation programmes at libraries, archives and museums, producing digital surrogates of valuable primary source material for humanities research. On the other hand, however, the creation of digital data and the tools to process that data need to advance hand in hand. Although many data sets of value to the humanities have been produced, a situation in itself of value to researchers who (for example) no longer need to travel to a particular archive and search through documents by hand, the tools (whether grid-based or not) to carry out new modes of research using that data have lagged behind.
scientific
techniques such as LIDAR and high-resolution 3D laser scanning, such as archaeology or the study of material culture, generate very large datasets (see http://ads.ahds.ac.uk/project/bigdata/final_report/bigdata_final_report_1.3.pdf); modern archives are increasingly born digital,
forming rapidly expanding resources of unprecedented detail for future historical research.
To illustrate the particular challenges raised by data in the humanities let us
consider a specific example. The LaQuAT (Linking and Querying Ancient Texts)
project
These resources were just three examples from a larger pool of related datasets that might have been included in a larger scale project to support the classics researcher; they were selected in order to investigate the feasibility of the approach and to identify issues that might arise. Considering both these three datasets and the researcher’s broader data environment, we may make the following observations:
published
on a website but not in a way that makes the resources particularly usable by a researcher.
We would not argue that such issues arise with respect to data only in the humanities, nor that all humanities data can be characterised in this way. These issues will however be recognised by a significant number of researchers in humanities disciplines.
While the LaQuAT project demonstrated the feasibility of such an approach, we
would not claim that integrating these three data resources in itself
facilitated useful research in the classics. However, these resources were just
three examples; there are many small, scattered yet related resources that would
be much more useful to researchers if they were linked along these lines. Their
utility would increase greatly once a certain critical mass is reached, and
together they would form a whole much greater than the sum of the parts virtual data centre.
It is however difficult to address the generation of meaningful links between these data resources in a purely automated fashion. The fuzzy, uncertain and interpretative nature of the data complicates the semantics of integration, and makes it difficult to describe the semantics of the relationships between data sets. For example, it was not always clear whether similarly named columns in independent databases really represented the same sort of information and could validly be linked. In some cases there were deeper semantic issues, for example when two independent data sets contained contradictory information. While automated linking of data proved problematic and uncertain, it would be useful to describe, rationalise, or quantify this uncertainty. Often this is a matter of judgement for the researcher, based on other evidence both internal and external to the resource, and in such a situation researchers will want to define for themselves how resources relate to one another. The results of one query may, taken together with other information available to a researcher, influence the questions that are asked of others, and a more natural view would be a complex workflow with the researcher at the centre.
One of the issues addressed in the previous case study was structural and semantic diversity. While there are a variety of standardisation activities with the aim of increasing interoperability between digital resources and enabling them to be used in combination, standardisation alone is unlikely to solve all problems related to linking up data. Humanists still have to deal with legacy data in diverse and often obsolete formats, and even when standards are used the sheer variety of data and research means that there is a great deal of flexibility in how the standards are applied. Moreover, standards are generally developed within particular disciplines or domains, whereas research is often inter-disciplinary, making use of varied materials, and incorporating data conforming to different standards. There will inevitably be diversity of representation when information is gathered together from different domains and for different purposes, and consequently there will always be a need to integrate this diversity.
Let us look now at technologies with the potential for managing data diversity.
The outputs of research or digitisation programmes are increasingly being held
in formally managed digital repositories, based on particular digital repository
software, the most popular examples of which are EPrints, DSpace and
Fedora.
In their earlier incarnations, repositories were used to manage relatively simple content, primarily research articles, sometimes less formal material such as presentations. However, digital repositories have been changing, both in the type of content that they hold, and the ways in which they are used. Repository software has become more sophisticated, allowing complex digital content to be stored in such a way that its internal structure and external context can be explicitly represented, managed, described and exposed. In particular, they are beginning to be used to manage research data in a variety of disciplines. This approach is useful for research in the humanities, and for the digitised resources, such as archives, that often form the primary sources of such research. It enables researchers, in collaboration with information scientists, to put a degree of formal order into our digital objects while not restricting their essential variation.
Digital repositories are likely to be a key component of any future
e-infrastructure for the humanities, or indeed for any other discipline
Whatever the outcome, given the increasingly cross-institutional and cross-disciplinary nature of much research, it is likely to involve material across repositories that are distributed and independently managed. The work undertaken by the grid community on the integration of structured information may be able to help us interact with these highly structured repositories and the complex data they contain, which will thus become virtualised data resources on a grid, like the storage devices and databases considered above. In particular such an approach may be used to hide the idiosyncratic heterogeneity of digital objects whose degree of standardisation will only decrease as their complexity grows, allowing data resources within repositories to be discovered, integrated and delivered in a form appropriate to the work to be done.
The LaQuAT project concluded that there was a need for the human element in
identifying connections between data resources. This led us to investigate the
potential for a more interactive annotation environment that would allow
researchers to create such connections and associate with them additional
information, such as confidence levels or details of the evidence on which a
decision was made. This provides another example of the use of grid technologies
to facilitate the integration and re-use of humanities data, based on the
promising gCube, which was developed within the European D4Science
project
To take a simple example from LaQuAT — the symbol ? was frequently used in the
original data sources to indicate that an entry in a database table was unknown.
An annotator might fill in these unknown entries, and either determine their value fully or associate an uncertainty measure with the inserted data, as well as recording the basis on which the decision was made. Suppose, for example, that a researcher is looking for a date in the Volterra database that turns out to be unknown. In an annotation environment, the researchers might replace such an unknown entry with an actual entry of 100 BC, adding 80% as a confidence value, and providing links to the external data sources on which the decision was based. To take another example, it may be impossible to determine in automated fashion whether two occurrences of the same name, or two non-identical variants of a name, refer to the same historical person. The annotation environment would allow such connections to be expressed, tagged with additional information about confidence and provenance.
Our investigations led us to conclude that there is a need among at least some
humanities researchers for tools supporting collaborative processes that involve
access to and use of complex, diverse and geographically distributed data
resources, including both automated processing and human manipulation, in
environments where research groups (in the form of virtual
organisations
), research data and research outputs may all cross
institutional boundaries and be subject to different, autonomous management
regimes. These tools may form part of a collaborative Virtual Research
Environment (VRE) (see
It is probably still not clear what people mean by grid-enabling data; indeed it
is safe to say that they use it to mean different things, with varying degrees
of precision, in different contexts. We have seen that the term has its origin in a certain class of technologies, and that researchers’ focus when using grids in data management was at the outset a technical one, addressing fast access to very large, distributed datasets on virtualised, joined up
storage devices.
Although both the gMan and LaQuAT experiments could be described as
grid-enabling
the data in question, as each makes use of a grid technology to
link up the datasets, the aims and results of the two projects are quite
different. Thus the grid should not be viewed as a single technology that joins
things together, but rather as a range of different technologies, possessing
among themselves a family resemblance,
which can be used to join things
together in a range of different, but related, ways. Moreover, it is not the
only sort of technology that can play such an integrating role; Web technologies
are a more mainstream example, particularly the emerging Linked Data
technologies.
Whatever combination of technologies is used, however, data in the humanities cannot be addressed in isolation from the researchers who create and use the data, and who also, of course, interpret the results. Research in the humanities can be highly interpretative, and will continue to be so even when supported by technological or scientific methods. It is necessary to link up not only data, but also services and researchers — in the plural. Research in the humanities need no longer be an activity carried out by a single scholar, but rather by collaborating researchers interacting within an extended network of data resources, digital repositories and libraries, tools and services, and other researchers, a shared environment that facilitates and sustains collaborative scholarly processes.
Of greatest interest to humanities researchers, whose data resources may be more
notable for their complexity, diversity and fuzziness than for their size, is
grid enabling
in the sense of joining and modelling disparate information. Arguably this contrasts with the emphasis of the grid’s origins in big science,
where the principal challenges involved processing queries and distributing very large datasets. By equipping humanities data to be interoperable using grid technologies, we can aggregate the answers it provides to research queries, and thus strengthen existing research practice and methods. In the examples discussed above, grid-enabling
amounts to integrating or mediating between autonomous resources with disparities in structure, form or semantics, thereby enabling research driven by that broader corpus of data that would not otherwise have been possible. Through the creation of Virtual Research Environments, grid technologies can also be used to create that allow humanities researchers to manipulate and combine datasets from scattered sources.