Massive Digitisation and Long Term Data Preservation

How can we develop more efficient, cheaper, faster digitization techniques allowing to perform mass digitisation programs?

How can we run crowdsourced digitization campaigns?

How can we perform efficient quality controls during digitisation processes?

How can we store and compress information as it is being digitized?

How can we attach metadata information documenting digitisation processes?

How should data be stored to ensure both efficient short-term use and long-term preservation?

What kind of storage support should be used?

How should data be encoded to ensure traceability despite successive re-encoding?

Publications

*

Navigating through 200 years of historical newspapers

Y. Rochat; M. Ehrmann; V. Buntinx; C. Bornet; F. Kaplan

2016. iPRES 2016 , Bern , October 3-6, 2016.

This paper aims to describe and explain the processes behind the creation of a digital library composed of two Swiss newspapers, namely Gazette de Lausanne (1798-1998) and Journal de Genève (1826-1998), covering an almost two-century period. We developed a general purpose application giving access to this cultural heritage asset; a large variety of users (e.g. historians, journalists, linguists and the general public) can search through the content of around 4 million articles via an innovative interface. Moreover, users are offered different strategies to navigate through the collection: lexical and temporal lookup, n-gram viewer and named entities.

*

The Venice Time Machine

F. Kaplan

2015. ACM Symposium on Document Engineering , Lausanne, Switzerland , September 08 - 11, 2015.

The Venice Time Machine is an international scientific programme launched by the EPFL and the University Ca’Foscari of Venice with the generous support of the Fondation Lombard Odier. It aims at building a multidimensional model of Venice and its evolution covering a period of more than 1000 years. The project ambitions to reconstruct a large open access database that could be used for research and education. Thanks to a parternship with the Archivio di Stato in Venice, kilometers of archives are currently digitized, transcribed and indexed setting the base of the largest database ever created on Venetian documents. The State Archives of Venice contain a massive amount of hand-written documentation in languages evolving from medieval times to the 20th century. An estimated 80 km of shelves are filled with over a thousand years of administrative documents, from birth registrations, death certificates and tax statements, all the way to maps and urban planning designs. These documents are often very delicate and are occasionally in a fragile state of conservation. In complementary to these primary sources, the content of thousands of monographies have been indexed and made searchable.

*

Venice Time Machine : Recreating the density of the past

I. di Lenardo; F. Kaplan

2015. Digital Humanities 2015 , Sydney , June 29 - July 3, 2015.

This article discusses the methodology used in the Venice Time Machine project (http://vtm.epfl.ch) to reconstruct a historical geographical information system covering the social and urban evolution of Venice over a period of 1,000 years. Given the time span considered, the project used a combination of sources and a specific approach to align heterogeneous historical evidence into a single geographic database. The project is based on a mass digitization project of one of the largest archives in Venice, the Archivio di Stato. One goal of the project is to build a kind of ‘Google map’ of the past, presenting a hypothetical reconstruction of Venice in 2D and 3D for any year starting from the origins of the city to present-day Venice.

*

Quelques réflexions préliminaires sur la Venice Time Machine

F. Kaplan

L'archive dans quinze ans; Louvain-la-Neuve: Academia, 2015. p. 161--179.

Encore aujourd’hui la plupart des historiens ont l’habitude de travailler en toutes petites équipes, se focalisant sur des problématiques très spécifiques. Ils n’échangent que très rarement leurs notes ou leurs données, percevant à tort ou à raison que leurs travaux de recherche préparatoire sont à la base de l’originalité de leurs travaux futurs. Prendre conscience de la dimension et la densité informationnelle des archives comme celle de Venise doit nous faire réaliser de l’impossibilité pour quelques historiens, travaillant de manière non coordonnée de couvrir avec une quelconque systématicité un objet aussi vaste. Si nous voulons tenter de transformer une archive de 80 kilomètres couvrant mille ans d’histoire en un système d’information structuré il nous faut développer un programme scientifique collaboratif, coordonné et massif. Nous sommes devant une entité informationnelle trop grande. Seule une collaboration scientifique internationale peut tenter d’en venir à bout.

*

X-ray spectrometry and imaging for ancient administrative handwritten documents

F. Albertin; M. Stampanoni; E. Peccenini; Y. Hwu; F. Kaplan et al.

X-Ray Spectrometry. 2015.

DOI : 10.1002/xrs.2581.

‘Venice Time Machine’ is an international program whose objective is transforming the ‘Archivio di Stato’ – 80 km of archival records documenting every aspect of 1000 years of Venetian history – into an open-access digital information bank. Our study is part of this project: We are exploring new, faster, and safer ways to digitalize manuscripts, without opening them, using X-ray tomography. A fundamental issue is the chemistry of the inks used for administrative documents: Contrary to pieces of high artistic or historical value, for such items, the composition is scarcely documented. We used X-ray fluorescence to investigate the inks of four Italian ordinary handwritten documents from the 15th to the 17th century. The results were correlated to X-ray images acquired with different techniques. In most cases, iron detected in the ‘iron gall’ inks produces image absorption contrast suitable for tomography reconstruction, allowing computer extraction of handwriting information from sets of projections. When absorption is too low, differential phase contrast imaging can reveal the characters from the substrate morphology

*

Ancient administrative handwritten documents: X-ray analysis and imaging

F. Albertin; A. Astoflo; E. Peccenini; Y. Hwu; F. Kaplan et al.

Journal of Synchrotron Radiation. 2015.

DOI : 10.1107/S1600577515000314.

Handwritten characters in administrative antique documents from three centuries have been detected using different synchrotron X-ray imaging techniques. Heavy elements in ancient inks, present even for everyday administrative manuscripts as shown by X-ray fluorescence spectra, produce attenuation contrast. In most cases the image quality is good enough for tomography reconstruction in view of future applications to virtual page-by-page `reading'. When attenuation is too low, differential phase contrast imaging can reveal the characters from refractive index effects. The results are potentially important for new information harvesting strategies, for example from the huge Archivio di Stato collection, objective of the Venice Time Machine project

*

X-ray Spectrometry and imaging for ancient handwritten document

F. Albertin; A. Astolfo; M. Stampanoni; E. Peccenini; Y. Hwu et al.

2014. European Conference on X-Ray Spectrometry, EXRS2014 , Bologna .

We detected handwritten characters in ancient documents from several centuries with different synchrotron x-ray imaging techniques. The results were correlated to those of x-ray fluorescence analysis. In most cases, heavy elements produced high image quality suitable for tomography reconstruction leading to virtual page-by-page “reading”. When absorption is too low, differential phase contrast (DPC) imaging can reveal the characters from the substrate morphology. This paves the way to new strategies for information harvesting during mass digitization programs. This study is part of the Venice Time Machine project, an international research program aiming at transforming the immense venetian archival records into an open access digital information system. The Archivio di Stato in Venice holds about 80 kms of archival records documenting every aspects of a 1000 years of Venetian history. A large part of these records take the form of ancient bounded registers that can only be digitize through cautious manual operations. Each page must be turned manually in order to be photographed. Our project explore new ways to virtually “read” manuscripts, without opening them,. We specifically plan to use x-ray tomography to computer-extract page-by-page information from sets of projection images. The raw data can be obtained without opening or manipulating the manuscripts, reducing the risk of damage and speeding up the process. The present tests demonstrate that the approach is feasible. Furthermore, they show that over a very long period of time the common recipes used in Europe for inks in “normal” handwritings - ship records, notary papers, commercial transactions, demographic accounts, etc. – very often produced a high concentration of heavy or medium-heavy elements such as Fe, Hg and Ca. This opens the way in general to x-ray analysis and imaging. Furthermore, it could lead to a better understanding of the deterioration mechanisms in the search for remedies. The most important among the results that we will present is tomographic reconstruction. We simulated books with stacks of manuscript fragments and obtained from sets of projection images individual views -- that correspond indeed to a virtual page-by-page “reading” without opening the volume.

*

Virtual X-ray Reading (VXR) of Ancient Administrative Handwritten Documents

F. Albertin; A. Astolfo; M. Stampanoni; E. Peccenini; Y. Hwu et al.

2014. Synchrotron Radiation in Art and Archaeology, SR2A 14 .

The study of ancient documents is too often confined to specimens of high artistic value or to official writings. Yet, a wealth of information is often stored in administrative records such as ship records, notary papers, work contract, tax declaration, commercial transactions or demographic accounts. One of the best examples is the Venice Time Machine project that targets a massive digitization and information extraction program of Venetian archives. The Archivio di Stato in Venice holds about 80 kms of archival documents spanning over ten centuries and documenting every aspect of Venetian Mediterranean Empire. If unlocked and transformed in a digital information system, this information could change significantly our understanding of European history. We are exploring new ways to facilitate and speed up this broad task, exploiting x-ray techniques, notably those based on synchrotron light. . Specifically, we plan to use x-ray tomography to computer-extract page-by-page information from sets of projection images. The raw data can be obtained without opening or manipulating the bounded administrative registers, reducing the risk of damage and accelerating the process. We present here positive tests of this approach. First, we systematically analyzed the ink composition of a sample of Italian handwritings spanning over several centuries. Then, we performed x-ray imaging with different contrast mechanisms (absorption, scattering and refraction) using the differential phase contrast (DPC) mode of the TOMCAT beamline of the Swiss Light Source (SLS). Finally, we selected cases of high contrast to perform tomographic reconstruction and demonstrate page-by-page handwriting recognition. The experiments concerned both black inks from different centuries and red ink from the 15th century. For the majority of the specimens, we found in the ink areas heavy or medium-heavy elements such as Fe, Ca, Hg, Cu and Zn. This eliminates a major question about our approach, since the documentation on the nature of inks for ancient administrative records is quite scarce. As a byproduct, the approach can produce valuable information on the ink-substrate interaction with the objective to understand and prevent corrosion and deterioration.