Transcription and Document Analysis

How can we automatically “read” ancient documents?

How can we segment and transcribe audio and video content?

What kind of digital preprocessing needs to be performed to facilitate these transcription processes?

How can automatic and manual processes be combined?

How can we monitor the level of errors and the biases of algorithms in these transcription processes?

External Members

Bonnard Quentin

 

Publications

Transforming scholarship in the archives through handwritten text recognition Transkribus as a case study

G. Muehlberger; L. Seaward; M. Terras; S. Ares Oliveira; V. Bosch et al. 

Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. Research limitations/implications – The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. Practical implications – Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

Journal Of Documentation

2019-09-09

Vol. 75 , num. 5, p. 954-976.

DOI : 10.1108/JD-07-2018-0114

A deep learning approach to Cadastral Computing

S. Ares Oliveira; I. di Lenardo; B. Tourenc; F. Kaplan 

This article presents a fully automatic pipeline to transform the Napoleonic Cadastres into an information system. The cadastres established during the first years of the 19th century cover a large part of Europe. For many cities they give one of the first geometrical surveys, linking precise parcels with identification numbers. These identification numbers points to registers where the names of the proprietary. As the Napoleonic cadastres include millions of parcels , it therefore offers a detailed snapshot of large part of Europe’s population at the beginning of the 19th century. As many kinds of computation can be done on such a large object, we use the neologism “cadastral computing” to refer to the operations performed on such datasets. This approach is the first fully automatic pipeline to transform the Napoleonic Cadastres into an information system.

2019-07-11

Digital Humanities Conference, Utrecht, Netherlands, July 8-12, 2019.

Text Line Detection and Transcription Alignment: A Case Study on the Statuti del Doge Tiepolo

F. Slimane; A. Mazzei; L. Tomasin; F. Kaplan 

In this paper, we propose a fully automatic system for the transcription alignment of historical documents. We introduce the ‘Statuti del Doge Tiepolo’ data that include images as well as transcription from the 14th century written in Gothic script. Our transcription alignment system is based on forced alignment technique and character Hidden Markov Models and is able to efficiently align complete document pages.

2015

Digital humanities, Sydney, Australia, June 29 – July 3, 2015.

A New Text-Independent GMM Writer Identification System Applied to Arabic Handwriting

F. Slimane; V. Margner 

This paper proposes a system for text-independent writer identification based on Arabic handwriting using only 21 features. Gaussian Mixture Models (GMMs) are used as the core of the system. GMMs provide a powerful representation of the distribution of features extracted using a fixed-length sliding window from the text lines and words of a writer. For each writer a GMM is built and trained using words and text lines images of that writer. At the recognition phase, the system returns log-likelihood scores. The GMM model(s) with the highest score(s) is (are) selected depending if the score is computed in Top-1 or Top-n level. Experiments using only word and text line images from the freely available Arabic Handwritten Text Images Database written by Multiple Writers (AHTID/MW) demonstrate a good performance for the Top-1, Top-2, Top-5 and Top-10 results.

Proceedings of 14th International Conference on Frontiers in Handwriting Recognition

2014

14th International Conference on Frontiers in Handwriting Recognition (ICFHR 2014), Crete Island, Greece, September 1-4, 2014.

p. 708-713

ICFHR2014 Competition on Arabic Writer Identification Using AHTID/MW and KHATT Databases

F. Slimane; S. Awaida; A. Mezghani; M. T. Parvez; S. Kanoun et al. 

This paper describes the first edition of the Arabic writer identification competition using AHTID/MW and KHATT databases held in the context of the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR2014). This competition has used the new freely available Arabic Handwritten Text Images Database written by Multiple Writers (AHTID/MW) and the Arabic handwritten text database called KHATT presented in ICFHR2012. We propose three tasks in this Arabic writer identification competition: the first and second are based respectively on word and text line level using the AHTID/MW database and the third one is paragraph based using the KHATT database. We received one system for the second task, three systems for the third task and none for the first task. All systems are tested in a blind manner using a set of images kept internal. A short description of the participating groups, their systems, the experimental setup, and the observed results are presented.

Proceedings of 14th International Conference on Frontiers in Handwriting Recognition

2014

14th International Conference on Frontiers in Handwriting Recognition (ICFHR 2014), Crete Island, Greece, September 1-4, 2014.

p. 797-802

GMM-based Handwriting Style Identification System for Historical Documents

F. Slimane; T. Schaßan; V. Märgner 

In this paper, we describe a novel method for handwriting style identification. A handwriting style can be common to one or several writer. It can represent also a handwriting style used in a period of the history or for specific document. Our method is based on Gaussian Mixture Models (GMMs) using different kind of features computed using a combined fixed-length horizontal and vertical sliding window moving over a document page. For each writing style a GMM is built and trained using page images. At the recognition phase, the system returns log-likelihood scores. The GMM model with the highest score is selected. Experiments using page images from historical German document collection demonstrate good performance results. The identification rate of the GMM-based system developed with six historical handwriting style is 100%.

Proceedings of the 6th International Conference of Soft Computing and Pattern Recognition

2014

6th International Conference of Soft Computing and Pattern Recognition, Tunis, Tunisia, August 11-14, 2014.

p. 387-392

A Network Analysis Approach of the Venetian Incanto System

Y. Rochat; M. Fournier; A. Mazzei; F. Kaplan 

The objective of this paper was to perform new analyses about the structure and evolution of the Incanto system. The hypothesis was to go beyond the textual narrative or even cartographic representation thanks to network analysis, which could potentially offer a new perspective to understand this maritime system.

DH 2014 book of abstracts

2014

Digital Humanities 2014, Lausanne, July 7-12, 2014.