Bern, baby, Bern!

From 12 to 16 May 2025, I joined four other researchers and three members of the RISM Digital Center team at the digital humanities centre at University of Bern in Switzerland to undertake a Short Term Scientific Mission (STSM) involving RISM-Online as part of the EU project EarlyMuse.  

RISM stands for Répertoire International des Sources Musicales, an international organisation founded in 1952 dedicated to comprehensively documenting all historical music sources extant across the world. RISM-Online is a database that grew out of the online RISM catalogue. The database allows a wide variety of search functions, including search by incipit (i.e., searching by the first few bars of notated music in the source). 

The residency focused on two main problems: (1) extracting information from the hard-copy RISM series catalogues, with the metadata suitably structured for import into RISM-Online and (2) finding a way to enable a broader range of contributers to enter data into Muscat, the back-end of the RISM-Online catalogue. Regarding the first challenge, RISM has published over 60 hard-copy catalogues listing historical music sources and holding information. Each catalogue is many hundreds of pages long, and, at present, the information from these catalogues must be entered manually into RISM-Online. Our task was to find a way to automate at least some of this process. 

First we investigated whether a machine learning tool (colloquially known as AI) could encode information directly from a PDF of one of the series. This proved problematic, so we tried various Optical Character Recognition (OCR) tools to see which one was the most reliable in extracting the PDF text into a plaintext format. The OCR process for the RISM Series is challenging, as the volumes include a wide mix of languages, unusual abbreviations within the metadata, and an unusual formatting structure, all of which cause problems for most OCR tools. The best solution we found was to isolate certain information on each page as a ‘card’, and frame the information related only to one entry (one source) from a page, and to train the most reliable AI tool thus far (FormX.ai) to extract relevant information and restructure it into MARCXML. For example, composer names were challenging for FormX.ai, as the names appear separately above the main body text in the document. Moreover, the name of the composer or author is not repeated for each of their works, with a hyphen is used in its place. We had to train the language learning model to add the omitted names in such cases. This process was able to provide accurate metadata structured correctly; however, FormX.ai can only process one card at a time, and it is hoped that further testing will provide a more efficient solution. 

Regarding the second challenge, at present data is entered into RISM-Online via the Muscat interface, which contributors access by contacting RISM-Online and being granted a username and password to ensure quality control and security for the database. Our task was to explore different technical approaches to crowdsourcing data in ways that would not compromise the integrity and accuracy of the database. Previous approaches have utilised spreadsheets submitted by individuals without access to Muscat. This has thus far proven problematic , as the data must be encoded and structured perfectly for the spreadsheet to import correctly. We decided that the best approach to explore was a simplified version of Muscat (MiniMuscat) that could be publicly accessible. We then drafted specifications for the MiniMuscat templates based on who we believed would be our target users (e.g., early music and/or dance enthusiasts, professional early music performers, archivists with an interest in music history, etc.). We reduced the number of data fields to those which we anticipated users would be able to populate accurately and consistently, and which Muscat contributors could proofread and edit efficiently. We then mocked-up a MiniMuscat interface with accompanying user guidelines, and tested it on one of the STSM 3 grantees who had not yet used Muscat. The test was mostly successful, with some necessary changes and clarification in the tutorial text identified and rectified.  

There remain many challenges to developing MiniMuscat, principally whether it would provide adequate return on investment of time and resources. However, it is clear that there are workable solutions for drawing catalogue data from a wider range of contributors while ensuring that RISM-Online continues to provide high-quality, reliable information for its users.  Developing MiniMuscat would also be a useful method for upskilling librarians and archivists, something that has been highlighted as an important endeavour by organisations like International Association of Music Libraries, Archives and Documentation Centres (IAML).   

In addition to our experiments, grantees in attendance presented their projects and different approaches to gathering data, creating and curating databases, and encoding metadata. Andrew Hankinson also presented new tools being developed at the RISM Digital Center. The week was an excellent opportunity to exchange knowledge, test the practical abilities of new technologies, and see how these technologies fit with long-standing processes developed at the RISM Digital Centre.

Next
Next

We are The Song Detectorists!