Transcription and Document Analysis

How can we automatically “read” ancient documents?

How can we segment and transcribe audio and video content?

What kind of digital preprocessing needs to be performed to facilitate these transcription processes?

How can automatic and manual processes be combined?

How can we monitor the level of errors and the biases of algorithms in these transcription processes?

External Members

Bonnard Quentin

Publications

*

A deep learning approach to Cadastral Computing

S. Ares Oliveira; I. di Lenardo; B. Tourenc; F. Kaplan

2019-07-11. Digital Humanities Conference, , Utrecht, Netherlands , July 8-12, 2019.

This article presents a fully automatic pipeline to transform the Napoleonic Cadastres into an information system. The cadastres established during the first years of the 19th century cover a large part of Europe. For many cities they give one of the first geometrical surveys, linking precise parcels with identification numbers. These identification numbers points to registers where the names of the proprietary. As the Napoleonic cadastres include millions of parcels , it therefore offers a detailed snapshot of large part of Europe’s population at the beginning of the 19th century. As many kinds of computation can be done on such a large object, we use the neologism “cadastral computing” to refer to the operations performed on such datasets. This approach is the first fully automatic pipeline to transform the Napoleonic Cadastres into an information system.

*

Text Line Detection and Transcription Alignment: A Case Study on the Statuti del Doge Tiepolo

F. Slimane; A. Mazzei; L. Tomasin; F. Kaplan

2015. Digital humanities , Sydney, Australia , June 29 - July 3, 2015.

In this paper, we propose a fully automatic system for the transcription alignment of historical documents. We introduce the ‘Statuti del Doge Tiepolo’ data that include images as well as transcription from the 14th century written in Gothic script. Our transcription alignment system is based on forced alignment technique and character Hidden Markov Models and is able to efficiently align complete document pages.

*

A New Text-Independent GMM Writer Identification System Applied to Arabic Handwriting

F. Slimane; V. Margner

2014. 14th International Conference on Frontiers in Handwriting Recognition (ICFHR 2014) , Crete Island, Greece , September 1-4, 2014. p. 708-713.

This paper proposes a system for text-independent writer identification based on Arabic handwriting using only 21 features. Gaussian Mixture Models (GMMs) are used as the core of the system. GMMs provide a powerful representation of the distribution of features extracted using a fixed-length sliding window from the text lines and words of a writer. For each writer a GMM is built and trained using words and text lines images of that writer. At the recognition phase, the system returns log-likelihood scores. The GMM model(s) with the highest score(s) is (are) selected depending if the score is computed in Top-1 or Top-n level. Experiments using only word and text line images from the freely available Arabic Handwritten Text Images Database written by Multiple Writers (AHTID/MW) demonstrate a good performance for the Top-1, Top-2, Top-5 and Top-10 results.

*

ICFHR2014 Competition on Arabic Writer Identification Using AHTID/MW and KHATT Databases

F. Slimane; S. Awaida; A. Mezghani; M. T. Parvez; S. Kanoun et al.

2014. 14th International Conference on Frontiers in Handwriting Recognition (ICFHR 2014) , Crete Island, Greece , September 1-4, 2014. p. 797-802.

This paper describes the first edition of the Arabic writer identification competition using AHTID/MW and KHATT databases held in the context of the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR2014). This competition has used the new freely available Arabic Handwritten Text Images Database written by Multiple Writers (AHTID/MW) and the Arabic handwritten text database called KHATT presented in ICFHR2012. We propose three tasks in this Arabic writer identification competition: the first and second are based respectively on word and text line level using the AHTID/MW database and the third one is paragraph based using the KHATT database. We received one system for the second task, three systems for the third task and none for the first task. All systems are tested in a blind manner using a set of images kept internal. A short description of the participating groups, their systems, the experimental setup, and the observed results are presented.

*

GMM-based Handwriting Style Identification System for Historical Documents

F. Slimane; T. Schaßan; V. Märgner

2014. 6th International Conference of Soft Computing and Pattern Recognition , Tunis, Tunisia , August 11-14, 2014. p. 387-392.

In this paper, we describe a novel method for handwriting style identification. A handwriting style can be common to one or several writer. It can represent also a handwriting style used in a period of the history or for specific document. Our method is based on Gaussian Mixture Models (GMMs) using different kind of features computed using a combined fixed-length horizontal and vertical sliding window moving over a document page. For each writing style a GMM is built and trained using page images. At the recognition phase, the system returns log-likelihood scores. The GMM model with the highest score is selected. Experiments using page images from historical German document collection demonstrate good performance results. The identification rate of the GMM-based system developed with six historical handwriting style is 100%.

*

A Network Analysis Approach of the Venetian Incanto System

Y. Rochat; M. Fournier; A. Mazzei; F. Kaplan

2014. Digital Humanities 2014 , Lausanne , July 7-12, 2014.

The objective of this paper was to perform new analyses about the structure and evolution of the Incanto system. The hypothesis was to go beyond the textual narrative or even cartographic representation thanks to network analysis, which could potentially offer a new perspective to understand this maritime system.