AI Model for automatic transcription of handwritten documents
The use of the model enables massive transcription of historical documents.
Service/Expertise Overview
Handwritten Text Recognition (HTR) is a recent and important technology used mostly by archives, libraries and investigators.
TraPrInq Portuguese Handwriting 16th-19th c. is the first generic model under open access. It is available on the platfrom of Digital Palaeography Transkribus, managed by ReadCoop, a European Cooperative (readcoop.eu). The model was created during the FCT funded exploratory project “Transcription of the Court Trials of the Portuguese Inquisition (1536-1821)” (ref.: EXPL/HAR-HIS/0499/2021).
Data (from paleographical transcriptions) by a team of 10 palaeographers: 1,3 million words (total of Training Set and Validation Set).
The model transcribes automatically with a CER (Character Error Rate) of 5,2%.
Fig 1: screenshot from Transkribus interface, giving an overview of the results of the 9th training of the model.
Competitive advantages
- Faster and cost-efficient transcriptions.
- Uses at institutional and individual levels (e.g. Master and Ph.D. projects).
- Any project toward online edition.
Applications
- Historical documents (16th-19th centuries).
- Model as base model.
- Extension of the model (other periods).
- Future connexion with LLM technology.
Further Details
A first public model for 17th cent. printed matter was released in 2020: https://readcoop.eu/model/latin-portuguese-print-17th-century/
Information about the tasks: since 2022, the blog of investigations of the project has been delivering reports about every training of the model: https://traprinq.hypotheses.org/
Project website: https://traprinq.mozellosite.com/home/
Research unit
CHAM – Centro de Humanidades