The Automatic Collation for Diversifying Corpora (ACDC) project will significantly improve the accuracy of handwritten text recognition (HTR) for Arabic-script manuscripts by developing a collation tool to automatically create large amounts of training data from existing digital texts and manuscript images without time-consuming human annotation of individual manuscripts. The ACDC project will accomplish this task by extending the capabilities of the text alignment tool passim and the HTR engine Kraken to align very poor initial HTR transcriptions of diverse manuscript exemplars with existing digital texts in order to automatically produce training data in a “distantly supervised” manner. The ACDC tool’s acceleration of the training data production process will enable, for the first time, the creation of generalizable Arabic and Persian HTR models required for the automatic digital transcription of large-scale Persian and Arabic manuscript collections.
In recognition of its far-reaching potential in the field of Islamicate humanities the ACDC project was recently awarded a generous grant from the National Endowment for the Humanities to support both the technical development of the project as well as pedagogic outcomes, including the upcoming Digital Paleography and Codicology Summer School.
The project is still in its initial phases, and will soon have its own website; for a more in-depth overview of its development and forward trajectory, see PI Dr. Matthew Miller’s Twitter thread introducing the project.