Development of an AI Metadata Extraction Model to Enhance Electronic Resources Indexing in Academic Libraries
Asifiwe J. Makawa¹, Paulina N. Kayungi¹, Daudi H. Danda¹, Mboni A. Ruzegea², William J. Mviombo³ and Neema F. Mosha⁴
1 Directorate of Library Services, Dar es Salaam University College of Education
2 Tanzania Library Services Board
3 Directorate of Library Services, Muhimbili University of Health and Allied Sciences
4 Directorate of Library Services, Nelson Mandela African Institution of Science and Technology
Email:
Abstract
The objective of this study was to create an AI model for extracting metadata from electronic resources, improve academic library indexing, utilize natural language processing techniques, and assess its performance. Data for the study was collected from three selected higher learning institutions, namely DUCE, MUHAS, and NM-AIST. These academic institutions use KOHA, an open-source integrated library system that supports library management and bibliometric analysis. Bibliographic metadata of 8,421 records covering the period from 2010 to 2022 was extracted from these institutions. Among them, 79% were books, 12% were open access journal articles and other online resources, and 9% were dissertations and theses. An ensemble learning model was developed that leveraged k-means clustering and natural language processing (NLP). Features captured in clustering included ISBN, barcode number, publication year, authors, publishers, titles, keywords, location, and call number. Sentiment Analyzer (SA) was used to extract sentiments from online articles. SA detected all references to the given subject and determined sentiment in each of the references using NLP techniques. In this study, an ensemble learning model was used as a meta-learning approach to leverage the strengths of both models and build a more robust and accurate metadata extraction model. An F1 score of 0.72 was obtained for the evaluation matrix, which combined two matrices, precision and recall, into a single metric by taking their harmonic mean. In simple terms, the F1 score was the weighted average mean of precision and recall used in natural language processing. The Ensemble Model significantly improves the accuracy of extracting bibliographic indexed resources in digital libraries using relevant search queries. This indicates that the model has a high precision in extracting relevant indexed electronic resources.
Keywords: Metadata extraction, artificial intelligence, NLP, OPAC, machine learning, digital libraries.
Proceedings of the 6th COTUL Scientific Conference, 11–12 November 2024 at TMDA, Mwanza, Tanzania
P.O. Box 4302,
Ali Hassan Mwinyi Road, Kijitonyama
(Sayansi ) COSTECH Building,
Dar es Salaam, Tanzaniaa
E-mail : chairperson@cotul.or.tz
Mobile : +255 757 547 856
Mobile : +255 784 315 281
Phone : +255 734 680 978