Past Master and Semester Projects

Historical newspaper article segmentation and classification using visual and textual features (Master project – Spring 2019)

Type of project: Master

Context

impresso project, one of whose objective is to semantically enrich 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text re-used). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.

Problems

  • Automatic processing of these sources is greatly hindered by the sometimes low quality of legacy OCR and OLR (when present) processes: articles are incorrectly transcribed and incorrectly segmented. This have consequences on downstream text processing (e.g. a topic with only badly OCRized tokens). One solution is to filter out elements before they enter the text processing pipeline. This would imply recognizing specific parts of newspaper pages known to be particularly noisy such as: meteo tables, transport schedules, cross-words, etc.
  • Additionally, besides advertisement recognition, OLR does not provide any section classification (is this segment a title banner, a feuilleton, an edito, etc.) and it would be useful to provide basic section/rubrique information.

Objective

Exploring the interplay between textual and visual features for segment recognition and classification in historical newspapers.

In particular:

  • studying and understanding the main characteristics of newspaper composition and its change over time; designing a classification tag set according to the needs (filtering out, classifying) and the capacities of the tools (visual vs. textual);
  • applying dhSegment on a set of segment types (feuilleton, title banner, meteo, funerals, ads, etc) and study performances, per and across newspapers, per and across time-buckets;
  • possibly refining dhSegment architecture in order to fit better to newspaper material;
  • understanding which elements (i.e. segment types) are not well dealt with by dhSegment and could benefit from textual information.
  • integrating textual features in the classification: either topic modeling or character based language models, both of them being currently developed within impresso.

The long term goal is enabling to automatically characterize and trace the evolution of newspaper layout and rubriques over time.

Overall, this work will require to:

  1. develop a processing chain and integrate it in impresso code framework;
  2. train and evaluate;
  3. investigate historical material;
  4. collaborate with impresso people.

Required skills

  • programming language: ideally python
  • understanding of machine learning principles, including deep learning approaches
  • communication/collaboration
  • familiarity with github
  • curiosity for historical material

Setting

This master project will be under the impresso hat and will be co-supervised by EPFL and Zurich university people: Simon Clematide (UZH), Maud Ehrmann (DHLAB), Sofia Oliveira (DHLAB), Frédéric Kaplan (DHLAB).

Add-on

A important part of impresso project is the design and development of an viz/exploration interface. If classification results are worth sharing (they should), they will be integrated in the interface and used by historians.

Historical newspaper image classification (Semester – Spring 2019)

Type of project: semester

Supervision: Maud Ehrmann, Sofia Oliveira, Frédéric Kaplan

Description:

Context – impresso project, one of whose objective is to semantically enrich 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text re-used). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.

Objective – Historical newspapers contains iconographic material (e.g. advertisements, photographies, illustrations, maps), which can be searched according to two main approaches:

  1. based on visual features, and
  2. based on textual description (metadata keywords) of images.

A forthcoming work will address the problem of image search by considering image similarity features (1). Instead, the objective of this semester project is to experiment with automatic image classification/tagging in order to allow text-based image retrieval (2), as described in the BNF exploratory project “Image Retrieval in Digital Libraries – A Multicollection Experimentation of Machine Learning techniques” (JP Moreux).

Throughout the project, the student’s mission will be to understand, use and adapt some of the techniques described in https://github.com/altomator/Image_Retrieval

Working steps:

  1. Extract iconographic material (photographies, drawings) from impresso OCR material.

Input: METS/ALTO files
Output: illustration metadata files (one file per issue) on S3 in json line format.

NB: This step might already be done or well advanced when the project will start.

  1. Prototype image description enrichment by using pre-trained deep learning models
  • IBM Watson API general model​​​​​​​​​​​​​​
  • IBM Watson face recognition
  • Google Cloud Vision​​​​​​​​​​​​​​
  • OpenCV (using integrated pre-trained DNN models)

For each service:

  • understand how to use it
  • understand the tagset which is used
  • think of how to store information
  • write a script to query the API, working with a small sample
  1. Effectively run the enrichment process on the whole impresso collection
  • prepare image set to work with
  • implement efficient processing by the different APIs​​​​​​​​​​​​​​
  • store back information adequately
  • analyse the results: what works nicely/badly
  1. If time: Prototype image description enrichment by re-training a deep learning model (transfer learning)
  • as in Altomator’s project, the objective would be image genre classification
  • annotate a dataset​​​​​​​
  • retrain a model with TensorFlow (Google CNN)
  • apply and evaluate

Expected outputs

  • impresso image collection semantically tagged
  • better understanding of image processing difficulties/challenges on historical material

Preferred skills

  • python, github;
  • general knowledge of how to use online APIs;
  • general knowledge of computer vision;
  • general knowledge of deep learning methods;
  • curiosity for historical material.

Deep reference parsing from humanities publication (Semester project – Autumn 2017)

Type of project: semester, master

Supervision: Giovanni Colavizza

Description: We have a large dataset of journal articles and books written by historians on the history of Venice. We also have a considerable amount of manually annotated references in footnotes, which were used to build parser in order to extract all references, using Conditional Random Fields. In this project you’ll build on all this, and use the CRF models as a baseline, in order to experiment with deep learning and try to beat current results. This project puts you in the favourable situation of having all the needed data and having just to focus on the method and its improvement.
A reference is something like: “George Ostrogorsky, History of the Byzantine State, Rutgers University Press, 1986”. We need to know its components (author, title, publisher, year), its general boundaries (offsets) and its typology out of a few classes (this one is a reference to a secondary source, or a published piece of scholarship).

Profile: Computer Science, Data Science or similar

Required skills: NLP, Machine Learning, esp. RNN

Bibliography: no previous application of RNN to this task 🙂 For a general idea see:
The Unreasonable Effectiveness of Recurrent Neural Networks
Recurrent Convolutional Neural Networks for Text Classification. Lai et al. AAAI 2015.
A Convolutional Neural Network for Modelling Sentences, Kalchbrenner et al. ACL 2014.

Finding primary sources in the Archive of Venice (Spring 2018)

Type of project: semester

Supervision: Giovanni Colavizza

Description: The historians of Venice cite a lot of documents from the Archives of Venice. They usually indicate the sources being cited in footnotes, using elaborated abbreviation systems. Your task is to connect every reference, which you will receive already extracted and parsed, to the unique identifier of the cited document coming from the information system of the Archive. You’ll receive ample annotated data and ground truth, and you’ll have to explore your way into supervised or unsupervised approached, assessing the tradeoffs of both.
The task seems easy but is not, due to the variability of the data at hand. Do not ask for this project if you’re not fond of Rubik’s cubes 😉

An example reference: “Archivio di Stato di Venezia, Santo Ufficio, b. 13, f. 2”. Reads like State Archive of Venice, Holy Office (the inquisition), box 13, file 2. Your task is to link this reference to the identifier of the Holy Office, see: http://213.136.75.178/siasve/cgi-bin/pagina.pl?Tipo=fondo&Chiave=8485

Profile: Computer Science, Data Science or similar

Required skills: NLP, Machine Learning

Dynamic History of the Histories of Venice (Semester project – Spring 2018)

Type of project: semester, master

Supervision: Giovanni Colavizza

Description: The literature on Venice has been accumulating since centuries. The DHLab has digitised and OCRed thousands of books and journal issues on the topic. Your task will be to reconstruct the development of this field of study from a macroscopic perspective, from the XIXth to the XXIst century, using the full text and the network of citations which are already there for your use. You might want to use dynamic topic models and/or network clustering, but the project is open ended and your results will be discussed with domain experts as to put them in context.

Profile: Computer Science, Data Science or similar

Required skills: NLP, Machine Learning, Networks

Bibliography (ask me for a copy if in need):

•    Anderson, Ashton, Dan McFarland, and Dan Jurafsky. 2012. “Towards a Computational History of the Acl: 1980-2008.” In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, 13–21. Association for Computational Linguistics.
•    Chavalarias, David, and Jean-Philippe Cointet. 2013. “Phylomemetic Patterns in Science Evolution—the Rise and Fall of Scientific Fields.” PloS One 8 (2): e54847.
•    Hall, David, Daniel Jurafsky, and Christopher D. Manning. 2008. “Studying the History of Ideas Using Topic Models.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 363–371. Association for Computational Linguistics.
•    Rule, Alix, Jean-Philippe Cointet, and Peter S. Bearman. 2015. “Lexical Shifts, Substantive Changes, and Continuity in State of the Union Discourse, 1790–2014.” Proceedings of the National Academy of Sciences 112 (35): 10837–44.

Using Context Awareness to Improve Domain­ Specific Named Entity Disambiguation (Master project – Spring 2017)

Supervision: Matteo Romanello, Maud Ehrmann

Overview:

This project deals with the challenges that lie in the adaptation of Named Entity Recognition (NER) to domain-specific phenomena, and specifically capturing the surrounding context of a Named Entity in order to improve its disambiguation.

The starting point for this project is an existing NER system, developed to extract bibliographic references to classical (i.e. ancient Greek and Latin) texts. Such a system performs:

1) Named Entity Recognition and Classification;
2) Relation Extraction and
3) Named Entity and Relation Disambiguation.

Such references are modelled as relations existing between a limited number of citation components, which constitute the Named Entities. Also, the systems relies on an underlying semantic knowledge base to acquire information that is domain specific (e.g. names and abbreviations of ancient authors in several languages, etc.).

The main problem for the disambiguation of these references is constituted by the implicit topicalisation that often characterises natural language. For example, an abstract of a publication focussing on Homer’s Iliad will very likely omit the indication of the author and work when referring to specific lines and books of the poem. This phenomenon, naturally, raises some interesting challenges when performing the relation and entity disambiguation. The key to solve this problem is to extract a set of features that can capture and represent effectively the context of a given bibliographic reference.

During this project the existing NER system–and specifically its rule-based named entity disambiguation module–will constitute the baseline. The student will set up and run experiments with different feature sets and machine learning algorithms in order to test whether any combination of the two can outperform the baseline. The identified solution will be implemented as a working prototype and evaluated. The data to be used during this project is a corpus of bibliographic abstracts from Classics, consisting of 7,300 abstracts for a total of 380,000 tokens. A subset of these abstracts containing approximately 25,000 tokens was manually validated and can be used as training or test set.

Skills:

Natural Language Processing, Machine learning (no understanding of Ancient Greek and Latin required)

Named Entity Disambiguation in Le Temps Archives (Semester project – Spring 2017)

Supervised by: Maud Ehrmann, Frédéric Kaplan

Description:

Le Temps corpus brings together about 4M news articles published by La Gazette de Lausanne and Le Journal de Genève over 200 hundred years (1798-1998). Printed issues have been digitally acquired through digitization and optical character recognition (OCR), which opened up many research opportunities, e.g. at linguistic and historical levels.

Named entity recognition (NER) has been applied on the whole archive, taking into account entities of type Person and Location. Many entity mentions have been extracted (ca. 30M of persons and 50M of locations), thanks to a rule-based algorithm. The performances of this extraction have been evaluated diachronically; evaluation details are reported in this paper, but overall the performances are as follows: 67.6% Precision and 18.3% Recall for Person, and 84.6% Precision and 46.6% Recall for Location. Naturally, performances fluctuate over years.

The natural next step is to implement named entity disambiguation (NED), which is the objective of the present semester project. Named entity disambiguation, also known as Entity Linking (EL) (but it depends on the exact task), corresponds to the task of assigning a unique identifier to an entity mentioned (generally several times) in texts. In other words, the objective is to align  textual mentions of entities with a unique identifier, usually taken from a knowledge base (e.g. Wikipedia, DBpedia, Freebase, Wikidata).

Entity Relation Extraction in Le Temps digital archives (Semester project – Spring 2017)

Supervised by: Maud Ehrmann, Frédéric Kaplan

Description:

Le Temps corpus brings together about 4M news articles published by La Gazette de Lausanne and Le Journal de Genève over 200 hundred years (1798-1998). Printed issues have been digitally acquired through digitization and optical character recognition (OCR), which opened up many research opportunities, e.g. at linguistic and historical levels.

Named entity recognition (NER) has been applied on the whole archive, taking into account entities of type Person and Location. Many entity mentions have been extracted (ca. 30M of persons and 50M of locations), thanks to a rule-based algorithm. The performances of this extraction have been evaluated diachronically; evaluation details are reported in [1], but overall the performances are as follows: 67.6% Precision and 18.3% Recall for Person, and 84.6% Precision and 46.6% Recall for Location. Naturally, performances fluctuate over years.

NEs are referential units which are key for subsequent corpus analysis by humanists. In order to  support further this analysis, it is possible to try to automatically detect relations between entities; this is the objective of the current project.

There exists various methods to do so, summarized in [1]. During this project, we will try to implement a distant supervision, such as presented in [2].

[1] http://cs.nyu.edu/grishman/survey.html

[2] http://dl.acm.org/citation.cfm?id=1690287

OCR correction of Le Temps (Master project – Autumn 2016)

Supervised by: Maud Ehrmann, Frédéric Kaplan

Le Temps corpus brings together about 4M news articles published by La Gazette de Lausanne and Le Journal de Genève over 200 hundred years (1798-1998). Printed issues have been digitally acquired through digitization and optical character recognition (OCR), which opened up many research opportunities, e.g. at linguistic and historical levels. However, OCR algorithms are not 100% accurate (most of all with old fonts and sometimes degraded material) and the corpus includes transcription mistakes.

Based on an already existing error detection module, the objective of this master project is to devise, implement and evaluate methods to correct OCR mistakes in Le Temps corpus. Please find an example in the linked files below. More precisely, the project will explore the following: 

(1) Error detection

Based on the existing error detection module and its evaluation, the objective is to improve error detection by:

(1) fine-tuning word frequency threshold

(2) fine-tuning integration of lexicon-based and frequency-based error detection

(3) exploring how to integrate evidence from word2vec

(2) Error correction

Based on the findings of error analysis, the objective is to implement one or several techniques for word error correction, such as: minimum edit distance, character n-grams correction patterns (to correct most frequent mistakes), topic modelling.

Development will include the evaluation of both steps. Optionally, methods could be tested on other corpora (either French or other language).

scanned text

errors highlighted

Skills

Reading and understanding French, Java or Python, optionally Scala (possibility to use clusters).

Bibliography

K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR) , 24(4):377{439, 1992.

Please find in the supplementary files a longer bibliography.

OCR correction of Le Temps (Semester project – Spring 2016)

Supervised by: Maud Ehrmann, Vincent Buntinx and Yannick Rochat

Le Temps corpus brings together about 4M news articles published by La Gazette de Lausanne and Le Journal de Genève over 200 hundred years (1798-1998). Printed issues have been digitally acquired through digitization and optical character recognition (OCR), which opened up many research opportunities, e.g. at linguistic and historical levels. However, OCR algorithms are not 100% accurate (most of all with old fonts and sometimes degraded material) and the corpus includes transcription mistakes.

The objective of the master project is to devise, implement and evaluate methods to correct OCR mistakes in Le Temps corpus. Please find an example in the linked files below. More precisely, the project would be developed as follows: 

(1) Error detection

Implementing and testing method(s) for error detection, based on frequencies, n-grams and/or lexicon look up. 

(2) Error analysis

Describing error typology (to guide error correction) and characterizing error distribution over the entire corpus (to spot highly-corrupted text zones).

(3) Error correction

Based on the findings of the error analysis, the objective will be to try one or several techniques for word error correction, such as: minimum edit distance, character n-grams, topic modelling.

(4) Evaluation 

This step willl include the manual annotation of a few samples and the evaluation of the alogorithm implemented in (3) using e.g. :  https://github.com/impactcentre/ocrevalUAtion  

Skills

Reading and understanding French, Java or Python, optionally Scala (possibility to use clusters).

Bibliography

K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR) , 24(4):377{439, 1992.

Please find in the supplementary files a longer bibliography.

The “Shazam” of paintings (Semester project – Spring 2016)

 Supervision : Frédéric Kaplan, Benoit Seguin, Isabella di Lenardo

The goal of this project is make a smartphone app that can identify paintings or details in paintings. The smartphone app will use the deep learning artwork recognition server, currently under development at the DHLAB to perform the matching. A painting or a detail in the database as similar as possible to the query should be returned. If the painting is unknown, the user will be able to register it, process its main characteretics features and thus extend the database of paintings that the system is able to recognize. During the project a first prototype should be built, deployed and evaluated.

Skills :

iOS or Android development, Curiosity for Art History

Profile:

Computer Science or Communication Systems

From archival documents to RDF graphs and back: a web interface to explore historical data and its sources (Semester project – Spring 2016)

Supervised by: Maud Ehrmann, Orlin Topalov

Overview
This project built on an already developed interface and aims at extending it by adding new key features and functionalities.
The available prototype, namely DHExplorer, allows to explore historical data represented as Linked Data. The exploration dashboard currently features three entry points: entity view, network view and data analytics view.
The present project will go a step further by:
– establishing, within the interface, the link between data and sources (i.e. images)
– complementing exploration features, especially by rendering people life events on a timeline.
An important aspect of the interface is that is should be as generic as possible, so as to deploy it seamlessly on different (but of same domain) RDF graphs.
Skills
web development (particularly JavaScript), interest for historical data, knowledge of Semantic Web tools and languages is a plus.

3D Reconstruction of Venice based on films (spring 2017)

Type of project: bachelor, semester, master

Supervised by: Nils Hamel, Fédéric Kaplan

Description:
The DHLAB is currently working on a new pipeline enabling to extract cloud points out of movies. This methods is extremely efficient to produce 3d models. This project consists in building a temporal cloud point model of the city of Venice by extracting data a large number of Films showing different parts of the city at different periods.

Ancient Photogrammetry (Autumn 2017)

Type of project: bachelor, semester, master

Supervised by: Nils Hamel, Sofia Oliveira, Frédéric Kaplan.

Description:
This project is part of the eratosthene 4D world indexation server project. As
the server is able to hold an earth scale 4D model, it requires efforts to
produce 3D models at different times that are injected in the server to feed the
4D information system.

In this project, old aerial images are considered through modern photogrammetry
techniques to build 3D models of the cities during the 20th century. Because of
the nature of the images, several image processing are required before to send
them to a photogrammetric pipeline. These image processing include artefacts
corrections, image transformation and position-weighted exposure corrections.

The goal of the project is to set an automated photogrammetric preprocessing
pipeline able to process old aerial images before to send them to a
photogrammetric pipeline.

Ancient Photogrammetry

Required skills:
Image processing, Mathematics (analytic geometry), Photogrammetry

Image credits:
Nils Hamel (DHLAB), IGN

Smart Cloud (Autumn 2017)

Type of project: bachelor, semester, master

Supervised by: Nils Hamel, Frédéric Kaplan

Description:
In many domains, such as GIS, point clouds are the fundamental data on which
information systems are based. The great advantages of point clouds are their
simplicity, the large amount of information they hold and the ability to certify
each of their elements in both relative and absolute frames.

Despite these significant advantages, point clouds are far away from smart data
as mesh and geographic database can be. The goal of this project is to define
algorithms able to analyse point clouds elements in order to compute segmentation
of their content by performing concept-driven grouping of the elements.

Depending on the available time, many different algorithms can be developed in
this project, from simple algorithm helping surface detection for geodetic
alignments to more advanced concept-driven automatic segmentation of the point
clouds.

The goal of this project is to show that point cloud, with the help of algorithms,
can be transformed in smart data allowing to solve different problems in the
domain of the interpretation and classification of their content.

Smart Cloud

Required skills:
Mathematics (analytic geometry), 2D/3D Rendering

Image credits:
Nils Hamel (DHLAB), SITG

Voxel 4D-Earth (Spring & Autumn 2018)

Type of project :
bachelor/master

Supervision :
Nils Hamel, Frédéric Kaplan

Description :
This project is linked with the active development of the eratosthene project that implements a server able to index and store a point-based 4D model of the entire earth.

In the context of the eratosthene project, a graphical client has been developed. It allows to browse the 4D earth model in both space and time. Currently, the rendering offered by the client is point-based leading to sparse visualisation.

The goal of this project is to implement a voxel-based graphical client in order to allows a continuous visualisation of the 4D model. This includes the implementation of the graphical aspects and the development of a procedure able to transform the server index into a voxel grid.

The project aims to answer the question of the improvement of the 4D earth model visualisation using continuous voxel grids.

Project illustration

Preferred skills :
Mathematics, Programming in C/C++, Graphical programming (OpenGL)

Image credits :
Nils Hamel (DHLAB)

Earth Deep Index (Spring & Autumn 2018)

Type of project :
master

Supervision :
Nils Hamel, Frédéric Kaplan

Description :
This project is related to the eratosthene 4D earth server. The goal of the project is to determine how 3D and 4D information of the earth structures can be analysed and classified using deep learning methods.

In the first place, the goal is to analyse how earth 3D cells can be organised in latent space of auto-encoders. This imply the development of an auto-encoder able to manage 3D rasters. The analyse on the latent space is set to gain a first insight of the classification made using deep neural networks on such data.

In a second time, the project analyses the possibility to establish a relation between the 3D rasters and the index that are used to store and query the data they contains. The goal is to try to give a answer on the existence of such a relationship and in which extend it can be applied on wide ranges and scales.

Project illustration

Preferred skills :
Mathematics, Programming, Geodesy (geographic frames), 3D/Voxel models, Deep learning (Tensorflow)

Image credits :
Nils Hamel (DHLAB)

Unlimited Train Tracks (Spring & Autumn 2018)

Type of project :
bachelor/master

Supervision :
Nils Hamel, Frédéric Kaplan

Description :
The goal of this project is to take advantage of specific video widely available on the web that are “driver view” video. These type of video are interesting from the 3D model computation point of view as they allow to build image data-set able to be used in photogrammetric pipelines.

The project consists in determining in which extend such media can be considered for the computation of large scale 3D models. This includes image data-sets extraction from the video and the management of the 3D model computation. The project has to show how far it is possible to go toward continuous digitisation of cities/environment though such process.

Project illustration

Preferred skills :
Mathematics, Programming, Geodesy (geographic frames), Image/Video processing, Photogrammetry, 3D models

Image credits :
Nils Hamel (DHLAB)

Deep in YouTube (Spring & Autumn 2018)

Type of project :
master

Supervision :
Nils Hamel, Frédéric Kaplan

Description :
The goal of this project is to determine in which extent it is possible to train and validate a neural network able to determine in advance if a given video is able to be considered in photogrammetric pipelines to produce 3D models. This would allow automatic video gathering process to setup massive 3D models computation.

In this project, the neural network has to be designed, trained and validated entirely. This includes the determination of the best topology, the format of the input data and the nature of the network output.

In addition, training data-sets are not available for this very specific application. The project also includes the computation of data-sets able to be used for the training of such networks.

Preferred skills :
Mathematics, Programming, Image/Video processing, Photogrammetry, Deep learning (Tensorflow)