Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource
Abstract
Publication Date:
2019
abstract:
The Petit Larousse Illustré (PLI) is a monolingual French dictionary which has been
published every year since the 1906 edition, and which is therefore a fundamental
record of the evolution of the French language. As a consequence of the pre-1948
editions of the PLI entering the public domain in 2018 the Nénufar (Nouvelle édition
numérique de fac-similés de référence) project was launched at the Praxiling laboratory
in Montpellier with the aim of digitizing and making these editions available
electronically. The project is still ongoing; various selected editions from each decade
are going to be fully digitized (so far the 1906, 1924 and 1925 editions have been
completed), and changes backtracked and dated to the specific year.
Nénufar's primary aim is to make the editions available and searchable via an advanced
search interface which will not only enable the selective querying of text by lemma and
type of content (definitions, examples, ...), but crucially also detect and study changes
by comparing different editions. In order to do so, a specific web interface has been put
in place . Alongside the digitized text, the Nénufar website contains high quality scans for each page. In compliance with current
open data best practices (Wilkinson et al., 2016), the project also aims to make the
source data available separately from the querying interface both for research and for
A similar project which presents data and scans from subsequent editions of the same
legacy dictionary has been carried out by the team behind the Swedish Academy's Wordlist
(see Holmer, Malmgren, and Martens (2016) and http://spraakdata.gu.se/saolhist/).
eLex 2019: Book of Abstracts 36 long-term preservation. The primary encoding format is TEI-XML; however in our case
the TEI encoding is closely inspired by the latest version of the TEI-Lex0 (Ba?ski et
al., 2017, Romary & Tasovac, 2018) guidelines for encoding lexicographic resources,
which are based upon TEI. The choice of a TEI based approach allows the Nénufar
project to align itself to other pre-existing initiatives and tools. By aligning ourselves
to TEI-Lex0 we will be able to make use of digitisation tools such as Grobid
(Khemakhem et al., 2017) which have TEI-Lex0 as their native format and which have
already been tested and used within the Nénufar project to speed up the digitization
of new editions. In addition we will be able to make use of ongoing initiatives to convert
TEI-Lex0 datasets to RDF using the W3C recommendation for publishing lexicons as
Linked Data, namely OntoLex-Lemon (McCrae et al., 2017; Bosque-Gil et al., 2016)
which will allow for the publication of the Nénufar dataset as an LOD graph. The LOD
version of the Nénufar dataset, now currently being developed, will be queryable from
the available SPARQL endpoint and contain all available editions as one single graph,
allowing for expert users to perform complex queries that could detect systematic
changes in the dataset. The LOD version is particularly adapted to be linked to other
datasets; more recent editions, once added, could also be of interest for NLP
applications
Iris type:
04.02 Abstract in Atti di convegno
Keywords:
e-lexicography; dictionaries; le petit larousse
List of contributors: