Researcher Oksana Dereza is using AI and machine learning to make the study of ancient and minority languages easier.
Data scientist and PhD researcher Oksana Dereza is developing AI tools to help scholars without technical expertise research ancient and minority languages more efficiently and effectively.
For Dereza, who is based at the Insight SFI Centre for Data Analytics in Galway, her interdisciplinary work in computer science and linguistics means double the challenges but also double the fun. In this way, she is never bored, and that is what attracted her to research in the first place.
In her desire for research accessibility, Dereza thinks the move towards hybrid teaching and learning events since the Covid-19 pandemic is a real positive. “I think it’s great,” she said.
“People who live outside of big cities or have mobility issues can now engage with scholars from all over the world from their living rooms!”
Tell us about your current research.
For the past four years, I’ve been a part of the Cardamom (Comparative Deep Models for Minority and Historical Languages) project at the Data Science Institute in the University of Galway.
As the name suggests, we’ve been working on bringing together cutting-edge deep-learning methods and under-resourced languages. Natural language processing (NLP) is very anglo-centric, and minority languages are still poorly covered in terms of available models, tools and datasets.
Our team has been trying to bridge this gap, with a particular focus on Irish and the languages of India. Recently, we released a workbench that makes models and technologies that we’re developing easily accessible to researchers who may not have good technical knowledge. It is an annotation tool that provides tokenisation, POS-tagging, similar word search and other NLP instruments for minority and historical languages.
As one of the two PhD students involved in the project, I’ve been focusing on the historical side of things. I deal with word embeddings, a kind of vector-space representations of words or subword units.
Training a good embedding model that would reflect semantic and syntactic relationships within a language requires copious amounts of data, which we don’t have in the case of most ancient and historical languages. Moreover, historical data presents a much higher level of variation compared to modern languages, which makes it extra hard for machine learning (ML) algorithms to discover patterns in it.
I’ve been exploring the possibilities of transfer learning between different stages of the same language, as well as trying to find embedding evaluation techniques most relevant to historical linguistics. As I came to computer science from Celtic studies, I mostly experiment with Old, Middle and Early Modern Irish texts, but at the end of the day, my work is applicable to any language.
In your opinion, why is your research important?
My research mostly caters to scholars in humanities: linguists, philologists, historians, anthropologists etc. I work on models and tools that make analysing ancient texts with modern digital methods possible regardless of one’s level of technical expertise. It also saves scholars considerable time by automating routine tasks. For example, to find words that occur in similar contexts, you used to have to read through many texts, pick examples by hand, then make your calculations and conclusions. Now, you can just send a query to an embedding model trained on the corpus of texts that you had to go through manually before.
With a growing interest in cultural heritage among the public, making historical text sources openly accessible, searchable and explorable is highly important, and my research contributes to the ‘searchable’ and ‘explorable’ parts. Imagine you’d like to find out how a specific Irish word was used a century ago. It is already possible with a couple of clicks thanks to the Historical Irish Corpus, but what if you could trace this word’s history all the way back to Old Irish, observing it in context, witnessing the changes in spelling and grammar, and learning which words were most similar to your query at different stages?
What inspired you to become a researcher?
It was J R R Tolkien who put this romantic image of a scholar studying ancient manuscripts and deciphering obscure texts in my head. I first read his essays and letters when I was around 16, I think. However, I’ve always been curious and easily bored. As a child, I was constantly posing questions and trying to find answers, delving into a different topic every other day just to keep myself entertained. Then I learned that it’s more or less what researchers do!
What are some of the biggest challenges or misconceptions you face as a researcher in your field?
As in any interdisciplinary field, NLP researchers face double the number of misconceptions compared to linguists or computer scientists. Working with non-mainstream languages and historical data makes it even more challenging. On the one hand, you have to prove to computer scientists that ancient and historical languages require more linguistic awareness, and you can’t just blindly apply something that worked for a dozen modern languages to Old Irish. On the other hand, you have to persuade historical linguists that what you’re doing doesn’t contradict traditional humanities research methods, and that it is aimed at helping scholars, not replacing them.
Find out how emerging tech trends are transforming tomorrow with our new podcast, Future Human: The Series. Listen now on Spotify, on Apple or wherever you get your podcasts.