Keio University

[Special Feature: Prospects for Digital Archives] Kiyonori Nagasaki: The Current State and Future of Digital Archives—From Publication to Collaboration

Publish: November 05, 2024

Writer Profile

  • Kiyonori Nagasaki

    Faculty of Letters Professor, Department of Library and Information Science

    Kiyonori Nagasaki

    Faculty of Letters Professor, Department of Library and Information Science

Introduction

A quarter of a century has passed since the trend of digitizing and publishing materials, known as digital archives (hereafter DA), began to spread. With the establishment of the Japan Society for Digital Archives, a forum has been formed where practitioners and researchers can interact while exploring the future of DA, and I feel that the number of people seeking to take on this role is steadily increasing. While there are various definitions of DA, it is ultimately an endeavor centered on digitizing, publishing, and sharing various types of materials; therefore, the diversity of the people involved is as vast as the types of materials themselves. Even so, since a certain level of shared discussion is possible regarding digital technology and the legal systems surrounding it, various communities, including academic societies, are being formed and activities are being carried out based on these axes.

Looking at recent issues of the Journal of the Japan Society for Digital Archives, Akihiko Takano's "Three Values of DA*1" revisits the "Three Values: Important Roles of Digital Archives" presented in the "Japan Search Strategic Policy 2021-2025: Making Digital Archives Part of Everyday Life." These are (1) the succession and reconstruction of records and memories, (2) a common knowledge infrastructure that supports communities, and (3) the formation of new social networks. Due to space constraints, I will not go into detail, but I understand the values discussed here to have one important element: the contents of the knowledge infrastructure collaborate, and as a result, people also collaborate, leading to the formation of a better social network.

DA becomes a topic of conversation when high-profile content is released, but as a whole, the vast majority of it is inconspicuous, waiting for someone to discover it. It continues to be preserved and published for the possibility that someone, someday, will find value in it. In fact, even items that do not have great value when viewed individually may gain value by being aggregated, by being positioned within a community, or by collaborating with various other materials.

What is DA Collaboration?

The first things that come to mind regarding collaboration are portal sites that enable cross-searching of metadata, such as Japan Search and Europeana. Furthermore, in recent years, possibilities for even finer-grained collaboration have been opening up. Here, I would like to briefly introduce the background and current status of this.

Fine-grained collaboration refers to a mechanism that makes it easier to add annotations from both inside and outside at a unit smaller than a single document or item in a DA, and to make the results of such intellectual work as sustainable as possible.

In order to add annotations from inside and outside, software is required to make it possible. In the past, these were often special mechanisms prepared by companies, researchers, or developers. However, if annotations are made with custom-made software, both parties must install the same software if they want to collaborate with an external site. Alternatively, even if implemented independently, the results of the intellectual work of annotation will become unusable unless software from the same developer is continuously installed when system updates become necessary. Furthermore, if that software is upgraded and loses compatibility with previous versions, or if development of the software ends, those results may be lost. For example, the confusion caused by Adobe's discontinuation of Flash is still fresh in our memory. Compared to the fact that paper books, which are the results of intellectual work published in print, can be viewed almost anytime at the National Diet Library, intellectual work in DA—aside from advanced initiatives and theoretical discussions—does not yet seem to be in a state where one can feel secure about sustainability at the field level.

As a measure to avoid such situations, it is widely practiced in various fields to separate data from software and create data in a standard and open format. For example, the data formats for Microsoft Word, Excel, and PDF are published as international standards, and various software programs that can use the same formats are widespread. Standardizing data formats while making them public is an important factor for increasing sustainability by making data usable in various software and reducing dependence on specific individuals or companies.

IIIFによるDA連携

For DA as a whole, as a result of the widespread use of the Web, various contents can be viewed with a single piece of software called a Web browser. Furthermore, in recent years, movements to standardize data formats in a way that better suits the form and content of DA have been spreading domestically and internationally. Of particular note is IIIF (International Image Interoperability Framework). What this standard normalizes is the ability to "specify partial positions or regions in various contents published on the Web using an internationally common data format."

This makes it possible, for example, to specify a location where a miniature has been cut out from a medieval Western manuscript published on one site, and display the image of the corresponding miniature published on another site perfectly aligned with the cut-out location on a Web browser. With this standard, it is also possible to extract only the relevant parts of information from different sites and combine them to create new content. A typical example is the famous collection of facial expressions, which extracts faces from art works of all times and places, centered on Japanese picture scrolls, and applies annotations to each. At the time of writing, 9,675 face images have been extracted from 108 works and are published as a dataset that anyone can use for research.

The spread of IIIF has made it possible to freely utilize Web content from around the world, thereby expanding the potential to increase the value of Web content. In the early days, the expression "releasing from silos" was often used. It seems that such a standard was devised as a way to increase the value of individual content was sought, given that each piece of Web content was confined within its own site, collaboration would incur high costs with no guarantee of success.

Internationally, IIIF has been adopted by the libraries of many leading universities in Europe and the United States for the Web publication of rare materials, and it has also been adopted by national libraries in several countries, including France, the UK, the US, and Germany. In Japan, organizations that publish large-scale content, such as the National Diet Library and the National Institute of Japanese Literature, have adopted it, so the number of IIIF-compliant contents in Japan is quite large. Incidentally, the Keio University Media Center also uses IIIF, and it seems this was the first example among university libraries in Japan.

By publishing in compliance with IIIF, DA can increase the possibility of adding new value by freely linking content at various levels, from individual items to parts of content. For details, please refer to "Digital Archives Opened by IIIF" (Bungaku Report), published by the authors this year.

TEI for Textual Materials

While IIIF is a standard for content collaboration regardless of the field, there are also various DA-related standards that increase utility by specializing in a specific field. Here, we focus on the TEI (Text Encoding Initiative) guidelines, a data format focused on the humanities, particularly text research. This is because textual materials such as classical books and ancient documents currently make up a large portion of DA, and if we are to consider their usability and potential for collaboration, standards that primarily target such materials are useful.

The TEI guidelines were started in 1987 by a group of researchers mainly in the humanities and information science from Europe and the United States. For over 30 years since then, it has been supported by a community centered on humanities researchers. Currently, the TEI Technical Council leads the revision of the guidelines roughly once every six months.

Even within the single term "text research in the humanities," there are various research methods, and the points of focus vary accordingly. Even when looking at the same textual material, depending on the field or interest, one might be interested in external aspects such as the format of the material, the quality of the paper, or the typeface of the characters, while others might be interested in content aspects such as the text itself, proper nouns that appear, or part-of-speech information for each word. Creating a common data format in this diverse field of humanities is no easy task. Overcoming this to formulate a common format is what TEI aims for. This effort not only applies digital technology or develops DA but can also evolve into discussions regarding methodology in the humanities, making it interesting as a cross-disciplinary initiative within the humanities.

The Problem of Multilingualism

Another element that has become important in the TEI community in recent years is the issue of multilingualism. Although there are many participants from outside the English-speaking world, the TEI guidelines themselves are written in English, and related discussions are also primarily conducted in English. Some point out that these guidelines implicitly assume the handling of materials in English. As a community, they are working on internationalization and multilingualization, and translations of the descriptions of tags and other elements into seven languages, including Japanese, have already been published. However, due to the volume and expertise required for the guidelines as a whole, no comprehensive translations have been published in recent years. The TEI community itself had never held an annual conference outside of Europe or the United States until it held one in Tokyo in 2018.

In the multilingualization of TEI, it is necessary to respond in terms of both content and practicality. On the practical side, Japanese translations of frequently used guidelines and tutorials are required. On the content side, it is difficult to apply the TEI guidelines, which were formulated assuming materials in Western languages, directly to Japanese classical books and ancient documents. Solving this challenge is not easy, but if it can be overcome, it will enable cross-sectional analysis and the sharing of tools in a form compatible with many digitized textual materials in Europe and the United States, thereby contributing significantly to the utilization of research data, which is a major trend in recent scholarly information distribution.

I began working on this around 2006, and ten years later, in 2016, we were able to establish the East Asian/Japanese Special Interest Group (SIG) as the first SIG in this association to discuss a specific linguistic region. Based on discussions in this SIG, and through discussions at annual conferences, with the Technical Council, and on GitHub, five years later in 2021, rules for ruby (furigana) frequently used in Japanese were added to the TEI guidelines*2. As for the trend of multilingualization, the movement is gradually strengthening, with the establishment of the Indian Texts SIG in 2017. The movement from Japan also encouraged the movement of researchers related to India, and such matters seem to be a point where Japan can continue to contribute internationally, as a strength of Japan where the humanities developed relatively early in the non-Western world.

The utilization of TEI guidelines in DA has only just begun in Japan, and I look forward to its future expansion. In particular, for the many ancient documents and classical books whose images are published in DA—namely, materials in classical Chinese or cursive script (kuzushiji)—it is fully expected that general viewers may not understand the meaning even if they can read the characters, or may not be able to read the characters at all. It is desirable to be able to add text data or provide modern Japanese translations. Adding such new content to DA will also increase its value. For details regarding TEI, please refer to "Introduction to Text Data Construction for the Humanities" (Bungaku Report), published by the authors last year.

DA Collaboration through the Combination of TEI and IIIF

Regarding collaboration with images in particular, it is possible to link and display text with any part of a IIIF-compliant image in accordance with the TEI guidelines. For example, in the Iwashimizu-sha Uta-awase created in compliance with TEI, manuscripts published by the Cabinet Library and Gunma University can be viewed such that while reading the text data, the corresponding parts of the IIIF-compliant images are displayed for parts where the two differ. In other words, the DA images published by each institution are being utilized by researchers of waka literature to provide value as an important element of academic content, without the publishers having to make any additional effort. Being able to check how something is written in the original manuscript with a single click—rather than going to see it in person or searching for the relevant part from the beginning—is far from the profound and dense experience of visiting the materials in person. However, various new possibilities can be considered, such as being able to properly view materials from a slightly distant field with very little effort, or utilizing it as an entry point for education in such research methods.

An example of publishing DA images with not only the original classical text but also modern Japanese and English translations is the "Juban Mushi-awase Emaki" (The Ten-Round Insect Poetry Contest Scroll) released this March. This also links TEI-compliant text to IIIF-compliant images, where the illustrations in the scroll corresponding to the waka are displayed, and furthermore, clicking on any of the three texts displays and highlights the corresponding part. From a content perspective as well as a technical one, modern Japanese and English translations connect DA content to people who cannot read classical Japanese but can read modern Japanese, or to people who can read English. Technical collaboration also contributes to connecting people. People connected through this content may contribute to this field in some way in the future. If that happens, a virtuous cycle will be formed in which the technical and content aspects mutually enhance each other.

In this way, DA created with standard data formats can serve as a core that supports collaboration technically, contextually, and personally. In future DA construction and operation, further promoting this direction will form a strong and rich foundation that encourages better knowledge sharing and supports the formation of social networks.

*Affiliations and titles are as of the time of publication.