Open MA project offers

Metadata mining of large collections of historical newspapers

Project type: semester (data science)

Supervisors: Maud Ehrmann, Matteo Romanello

Context: The impresso project aims at semantically enriching 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text re-used), and at enabling the visualization and exploration of the enriched sources via a co-designed interface. The source materials come from Swiss and Luxembourgish national libraries and correspond to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.

Description:

The focus of this project is to carry out an analysis of various type of metadata describing the newspapers in our corpus. The metadata to be analyzed belong to 3 main types:

  • descriptive metadata: e.g. title, dates, periodicity, etc. This information is given by libraries and archives, where it was edited manually, is sparse and heterogeneous, and was never checked systematically.
  • structural metadata: about the hierarchical structure of the collections, i.e. typically organized as newspaper>issue>page>article.
  • additional information: content item type (.e.g ad, illustration, table), font type, number of words, position on the page,  etc.

One of the great challenges for the impresso project, as well as for this semester project, is the scale of data. With a corpus consisting of several millions of content items (e.g. articles) and newspaper pages, it is difficult to maintain a good overview of what is available, and even more difficult to spot areas that are problematic (e.g. because of the poor OCR quality or of missing data)

Your task is to apply data analysis and visualization to the newspaper metadata in order to:

  • generate aggregated views on newspaper objects for users to gain insights on the collections (e.g. how many articles/illustrations/ads/etc per year and per newspaper, density of article per page, etc.);
  • identify possible ‘holes’ in data, i.e. identifying where data is not available;
  • use the insights gained by analyzing the metadata in order to complement incomplete library metadata, e.g. by inferring the newspaper’s periodicity from publication dates;
  • enable the comparison of 2 datasets of selected articles based on their metadata.

Given all metadata of the impresso newspaper collection, in this project you will:

  • gather all relevant metadata information from existing storages (S3 storage in JSON format, a MySQL database and possibly a Solr index);
  • produce a series of (Jupyter) notebooks containing visualization and analysis of the medata for different purposes and with multiple levels of granularity;
  • (optional) create an interactive dashboard for the interactive exploration of metadata of interest (e.g. by using plotly’s dash framework)
  • create a pipeline that produces a static result dataset (e.g dataframe) holding all figures, that the final impresso API could query to display the information. The pipeline should be easy to execute when new data comes into the storage.

Source of inspiration: the work of J.P Moreux from the BNF, see this github.

Skills:

Required:

  • python, with data analysis stack (pandas, matplotlib, plotly)

Preferred

  • familiarity with data science (having taken the “Applied Data Analysis” course is a plus);
  • curiosity about the application of data science in a digital humanities project;
  • familiarity with the parallel computing library dask would be a great plus;
  • ability to manage oneself and proactivity.

Historical newspaper image classification

Type of project: semester

Supervision: Maud Ehrmann, Sofia Oliveira, Frédéric Kaplan

Description:

Context – impresso project, one of whose objective is to semantically enrich 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text re-used). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.

Objective – Historical newspapers contains iconographic material (e.g. advertisements, photographies, illustrations, maps), which can be searched according to two main approaches:

1. based on visual features, and

2. based on textual description (metadata keywords) of images.

A forthcoming work will address the problem of image search by considering image similarity features (1). Instead, the objective of this semester project is to experiment with automatic image classification/tagging in order to allow text-based image retrieval (2), as described in the BNF exploratory project “Image Retrieval in Digital Libraries – A Multicollection Experimentation of Machine Learning techniques” (JP Moreux).

Throughout the project, the student’s mission will be to understand, use and adapt some of the techniques described in https://github.com/altomator/Image_Retrieval

Working steps:

  1. Extract iconographic material (photographies, drawings) from impresso OCR material.

     Input: METS/ALTO files
Output: illustration metadata files (one file per issue) on S3 in json line format.

     NB: This step might already be done or well advanced when the project will start.

  1. Prototype image description enrichment by using pre-trained deep learning models

  • IBM Watson API general model​​​​​​​​​​​​​​
  • IBM Watson face recognition
  • Google Cloud Vision​​​​​​​​​​​​​​
  • OpenCV (using integrated pre-trained DNN models)

    For each service:

  • understand how to use it
  • understand the tagset which is used
  • think of how to store information
  • write a script to query the API, working with a small sample
  1. Effectively run the enrichment process on the whole impresso collection

  • prepare image set to work with
  • implement efficient processing by the different APIs​​​​​​​​​​​​​​
  • store back information adequately
  • analyse the results: what works nicely/badly
  1. If time: Prototype image description enrichment by re-training a deep learning model (transfer learning)

  • as in Altomator’s project, the objective would be image genre classification
  • annotate a dataset​​​​​​​
  • retrain a model with TensorFlow (Google CNN)
  • apply and evaluate

Expected outputs

  • impresso image collection semantically tagged
  • better understanding of image processing difficulties/challenges on historical material

Preferred skills

  • python, github;
  • general knowledge of how to use online APIs;
  • general knowledge of computer vision;
  • general knowledge of deep learning methods;
  • curiosity for historical material.

Improving handwritten text recognition with transfer learning

Type of project : semester

Supervisors : Sofia Ares Oliveira, Frederic Kaplan

In recent years handwritten text recognition (HTR) systems have seen drastic improvement in performance allowing to accurately transcribe scripts from different periods and hands. However, these systems usually show high recognition rates on particular datasets or collections but perform poorly when applied to other scripts. While several datasets with different typologies and scripts have been made available to the community lately, obtaining a good recognition rate when dealing with a new collection is still challenging. To solve this problem, one interesting approach that is being investigated is transfer learning. Transfer learning (TL) aims to extract the knowledge from one (or more) source tasks and apply this knowledge to a target task.

Based on recent work, the goal of the project is to leverage the several datasets available to integrate the transfer learning approach in the current HTR system pipeline.

The main steps in the project are :

  • Understand the general structure of the current HTR pipeline;
  • Collect and prepare the existing datasets to be used in the current HTR pipeline;
  • Design and implement the transfer learning strategies;
  • Evaluate and compare the effect on the performances of the different TL strategies and different datasets;

At the end of the project, a functional transfer learning pipeline as well as a careful evaluation of the implemented methods should be delivered, this in order to determine which strategies to use in the context of massive transcription of hetereogenous historical documents.

Preferred skills :

Python (TensorFlow, OpencCV); git (github/gitlab); general knowledge in machine learning and/or deep learning; general knowledge in image processing and computer vision

Bibliography :

  1. Aradillas, J. C., Murillo-Fuentes, J. J., & Olmos, P. M. (2018). Boosting Handwriting Text Recognition in Small Databases with Transfer Learning. arXiv preprint arXiv:1804.01527.
  2. Granet, A., Morin, E., Mouchère, H., Quiniou, S., & Viard-Gaudin, C. (2018) Transfer Learning for Handwriting Recognition on Historical Documents. International Conference on Pattern Recognition Applications and Methods.
  3. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.

Cadasters & Cities History

Type of project :

Bachelor/Master

Supervision :

Sofia Oliveira, Nils Hamel, Frédéric Kaplan

Description :

This project aims to take advantage of digitized historic two-dimensional cadasters to extract 2D and 3D models of European cities providing a view of their evolution. The goal of the project is to create and exploit an automated pipeline able to segment cadasters producing vectorial 2D representations then extrapolated in simple 3D meshed representations.

In the project, the different aspects of the pipeline have to be understood, improved and put together to create the automated pipeline. An existing deep learning driven method is used to segment the cadastre and produce vectorial 2D representations of the maps sheets. At this stage, 2D vector representations have to be available.

Taking advantage of the 2D vector representations, the pipeline has to implement the computation of a 3D meshed models of the cadasters using 2D object interpretation to separate building from ground elements. Two different approaches are considered to add the third dimension to the 3D model involving simple extrusion and modern point-clouds driven analysis.

At the end of the project, an automated pipeline has to be delivered allowing to process cadasters in mass to produce the 2D and 3D representations of the cities, opening a view at their 3D history and evolution ready to be ingested in 4D information systems.

 Cadasters & Cities History

Preferred skills :

Computer Vision, Python or C/C++, Deep Learning & Tensorflow, 2D and 3D Models

Image credits :

Sofia Oliveira (DHLAB), Nils Hamel (DHLAB)

Deep 4D Earth

Type of project :

Bachelor/Master

Supervision :

Nils Hamel, Frédéric Kaplan

Description :

In this project, a 4D model of Earth is considered through a dedicated information system that store spatial and temporal information using a specific indexation framework. This allows standardizing the Earth representation through a discretization of space and time provided by the considered indexation. It follows that massive datasets of standardized 3D and 4D information can be extracted from the information system that are suitable for neural network training.

The goal of this project is to check in which extend deep learning can be used to improve the quality of the Earth 4D model. As the 4D model is composed by aggregation of large amount of local models (in space and time), some areas are not covered by any information as other are widely populated. The goal is then to take advantage of the well describe space and time regions to deduce the missing information in less known areas.

In order to achieve such objectives, three different approaches are considered, leading to three types of neural networks each specialized in one specific knowledge. Put together, these three types of neural networks should be able to correctly extrapolate the missing parts of the 4D models.

The first type of neural network consists in 3D models super-sampling (models densification). As parts of the Earth can be known with high degrees of details, they can be used to learn super-sampling in order to bring any model to the same degree of density, leading to a more homogeneous 4D model of the Earth. The second type of neural network is related to model completion. As part of the 4D Earth model are missing, known parts can be used to extrapolate the missing parts trying to create a continuation of the available data. The last type of neural network occurs along the time dimension to complete the time history of a specific area based on known and well described epochs.

The final goal of this project is to determine in which ways this three types of neural networks can be put together to create completion methodology for the Earth 4D model. As each network is specialized on a specific task, putting their results into perspective can improve their results and their stability.

 Deep 4D Earth

Preferred skills :

Python, C/C++, Deep learning & Frameworks (TensorFlow/PyTorch), 3D Models and Rasters

Image credits :

Nils Hamel (DHLAB)

4D Earth : Web Client

Type of project :

Bachelor/Master

Supervision :

Nils Hamel, Frédéric Kaplan

Description :

The 4D Earth project aims to provide a 4D representation of the Earth in the three space dimensions and in time. The 4D model is stored through a C server allowing 3D/4D data injection and query. In parallel of the server, a C graphical client is available as an application to browse and display the 4D Earth model.

The goal of this project is to implement a web-based version of the graphical client allowing to browse and display the 4D Earth model directly in web browsers. Such development would greatly increase the accessibility of the 4D Earth model. Two approaches are considered, and the first objective of the project is to determine which is the most suitable.

The first approach consists in translating the C graphical client into a Javascript/WebGL code allowing to access the graphical client features in a web browser. The goal of this approach is then to end up with a working implementation of the navigation interface embedded in a web browser.

The second approach, more experimental, consists in replacing OpenGL in the C graphical client by the Vulkan rendering engine allowing pure offscreen rendering. This opens the possibility to send rendered frames in a video stream that can be captured through a web page. The user events have to be captures through the web page and broadcasted back to the remote client allowing fully working interface.

 4D Earth : Web Client

Preferred skills :

HTML, Javascript, WebGL & Framework, C/C++, OpenGL/Vulkan, Video Streaming

Image credits :

Nils Hamel (DHLAB)

ScanVan : Spherical Camera

Type of project :

Bachelor/Master

Supervision :

Vincent Buntinx, Nils Hamel, Frédéric Kaplan

Description :

ScanVan is an FNS project aiming to theorize and build an omnidirectional spherical camera able to compute 3D model from true and full panoramic images. The goal of the project is to use the camera onboard of a vehicle to massively scan cities on a daily basis to produce 4D digital views. Specific algorithms are implemented onboard of the device allowing to obtain 3D dense models of cities directly out of the vehicle.

In the context of the position and orientation computation of each capture of the camera, a spherical-specific algorithm is implemented. This algorithm is able to compute the position and orientation of an arbitrary amount of capture at a time as the required matches are known. This feature allows an additional degree of freedom in the way positions and orientations are computed.

The goal of this project is to implement a procedure, on the basis of a first estimation of the position and orientation, that is able to determine a graph of the cluster of camera captures that can be considered all at once for the computation of the positions and translations. Doing so allows to decrease the error performed during the estimation of the visual odometry leading to better results.

A possible extension to this project would be to take advantage of the computation of the captures cluster graph to implement an optimization process able to broadcast the error on the visual odometry along large portion of the trajectory. This would allows taking into account loops that can occurs on trajectories leading to robust visual odometry estimation.

 ScanVan : Spherical Camera

Preferred skills :

Structure from Motion, Computer Vision, Python or C/C++

Image credits :

Marcelo Kaihara (HES-SO Valais)

Draining Youtube

Type of project :

Bachelor/Master

Supervision :

Nils Hamel, Frédéric Kaplan

Description :

This project consists in developing a methodology to automatically extract and compute 3D models out of Youtube videos to take advantage of the enormous amount image datasets available through this platform. The project consists in the development and improvement of three main steps.

The first and main steps consist in developing a simple methodology allowing to compute all possible 3D models out a selected video using modern and well-implemented structure from motion pipelines. The goal is to detect all image sets than can be extracted from the video able to produce 3D models.

The next step is a filtering step that has to implement model-based criteria able to accept or reject the 3D models extracted from a specific video. The goal is here to filter out all the 3D results that does not correspond to an accurate representation of a 3D scene and to eliminate 3D reconstruction failure.

The final step is to implement a way of traveling along the Youtube videos to gather more video showing good probability of containing image datasets usable for 3D models computation. Several methodologies can be considered as the Youtube Search results or the Youtube Related video links. The goal is to end up with an automated process able to minimize consideration of uninteresting video for 3D models computation.

 Draining Youtube

Preferred skills :

BASH or Python Script, Structure from Motion, 3D Models

Image credits :

Nils Hamel (DHLAB)

Historical newspaper article segmentation and classification using visual and textual features

Type of project: Master

Context

impresso project, one of whose objective is to semantically enrich 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text re-used). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.

Problems

  • Automatic processing of these sources is greatly hindered by the sometimes low quality of legacy OCR and OLR (when present) processes: articles are incorrectly transcribed and incorrectly segmented. This have consequences on downstream text processing (e.g. a topic with only badly OCRized tokens). One solution is to filter out elements before they enter the text processing pipeline. This would imply recognizing specific parts of newspaper pages known to be particularly noisy such as: meteo tables, transport schedules, cross-words, etc.
  • Additionally, besides advertisement recognition, OLR does not provide any section classification (is this segment a title banner, a feuilleton, an edito, etc.) and it would be useful to provide basic section/rubrique information.

Objective

Exploring the interplay between textual and visual features for segment recognition and classification in historical newspapers.

In particular:

  • studying and understanding the main characteristics of newspaper composition and its change over time; designing a classification tag set according to the needs (filtering out, classifying) and the capacities of the tools (visual vs. textual);
  • applying dhSegment on a set of segment types (feuilleton, title banner, meteo, funerals, ads, etc) and study performances, per and across newspapers, per and across time-buckets;
  • possibly refining dhSegment architecture in order to fit better to newspaper material;
  • understanding which elements (i.e. segment types) are not well dealt with by dhSegment and could benefit from textual information.
  • integrating textual features in the classification: either topic modeling or character based language models, both of them being currently developed within impresso.

The long term goal is enabling to automatically characterize and trace the evolution of newspaper layout and rubriques over time.

Overall, this work will require to:

1. develop a processing chain and integrate it in impresso code framework;

2. train and evaluate;

3. investigate historical material;

4. collaborate with impresso people.

Required skills

  • programming language: ideally python
  • understanding of machine learning principles, including deep learning approaches
  • communication/collaboration
  • familiarity with github
  • curiosity for historical material

Setting

This master project will be under the impresso hat and will be co-supervised by EPFL and Zurich university people: Simon Clematide (UZH), Maud Ehrmann (DHLAB), Sofia Oliveira (DHLAB), Frédéric Kaplan (DHLAB).

Add-on

A important part of impresso project is the design and development of an viz/exploration interface. If classification results are worth sharing (they should), they will be integrated in the interface and used by historians.

A Visual Detector for Viral News

Type of project: Semester

Supervisors: Matteo Romanello, Dario Rodighiero

Required skills: Data visualization, Interface design, Front- and back-end programming

illustration

Description: Within the context of the impresso project, this semester project aims to design an interface that allows for exploring a set of clusters representing a large dataset of text. Such clusters are mainly characterized by two or more passages that are present in different texts within the dataset. These passages represents repetitions that might work as an index to access the large dataset.

Establishing the reason for and nature of this similarity is usually the task of historians, performed through a close reading of texts. But prior to that, it is essential to single out the most interesting clusters, such as those containing potential viral news.

When a dataset contains millions of text passages this operation is very challenging. In this context, the aim of the semester project is to design and to create an advanced interface aimed to visually represent these cluster helping historians in viral news identification.

We propose to develop an interface displaying text-reuse clusters along three meaningful dimensions:

– Size: the number of passages in the cluster (min. 2, max. up to 20).

– Time: the time span covered by the cluster as virals news usually cover a larger time span.

– Lexical overlap: the extent to which passages in the same cluster share the same set of tokens.

Your goal will be to study the design of such an interface with a focus on graphic design, user experience, and identification rate.

Image credits :
Dario Rodighiero (DHLAB)

Analyse des livres rares à la Bibliothèque nationale de France

Type of project : Master thesis
Professor: Frédéric Kaplan
Semester of project: Spring 2019
Project summary: Projet de master ayant comme corpus d’étude un ensemble homogène (en terme de mise en page) d’incunables illustrés numérisés. L’étudiant mettra en place une stratégie de segmentation pour rechercher des ornements typographiques (bandeaux, cul-de-lampe, lettrines, encadrements de pages de titres, etc.) et identifier l’emplacement des gravures sur les pages. Il devra ensuite construire un outil permettant de repérer le remploi de matrices de gravures sur bois (repérage d’images identiques ou proches) et extraire les réseaux liant des ensembles de matrices à certaines productions. Par cette approche la structure du réseau des imprimeurs devrait apparaître. Le stage se fera à Paris en co-supervision avec les responsables de collection.
Contact: Prof. Frédéric Kaplan

Analyse des manuscrits enluminés à la Bibliothèque nationale de France

Type of project : Master thesis
Professor: Frédéric Kaplan
Semester of project: Spring 2019
Project summary: Projet de master ayant comme corpus d’étude un ensemble homogène (en terme de mise en page, époque…) de manuscrits enluminés numérisés. L’étudiant mettra en place une stratégie de segmentation sur l’identification de l’emplacement d’une enluminure sur une page. Au sein du corpus d’enluminures, il s’agira d’identifier un contenu iconographique (par ex. un animal) des enluminures pour permettre la recherche mais aussi une analyse quantitative des éléments iconographiques représentés. Le stage se fera à Paris en co-supervision avec les responsables de collection.
Contact: Prof. Frédéric Kaplan

Reconstruction du Louvre en 4D

Type of project : Master thesis
Professor: Frédéric Kaplan
Semester of project: Spring 2019
Project summary: L’objectif du projet est d’utiliser les documents d’archives du Louvre pour produire une représentation évolutive multi-échelle de sa structure au travers des siècles. L’étudiant devra développer une approche pour lier les documents numérisés qui documentent le modèle à une plateforme géohistorique. Un des enjeux sera la représentation des incertitudes du modèle de manière visuelle et la possibilité d’effectuer une modélisation incrémentale et versionnée. Le stage se fera à Paris en co-supervision avec la responsable des recherches sur l’histoire du Louvre.
Contact: Prof. Frédéric Kaplan

Numérisation des grandes collections photographiques à Venise

Type of project : Master thesis
Professor: Frédéric Kaplan
Semester of project: Spring 2019
Project summary: L’objectif du projet est l’analyse du corpus de plusieurs centaines de milliers de photographies d’art constitué dans le cadre du projet Replica. L’enjeu sera de résoudre par une méthode semi-automatique les problèmes d’attribution concernant plusieurs milliers d’oeuvres en comparant les résultats avec d’autres grandes bases de données numérisées. Le projet aura lieu à Venise à la fondation Cini.
Contact: Prof. Frédéric Kaplan