‘Lots of good data is the holy grail, but lots of bad data is very dangerous’


5 Jun 2019

Claire Gormley. Image: UCD

Claire Gormley of the UCD School of Mathematics and Statistics is on a mission to show that data is more than just a modern buzzword.

After completing her bachelor’s degree in mathematics at Trinity College Dublin, Claire Gormley undertook a visiting scholarship at the University of Washington in 2005.

She returned to Ireland in 2006 as an assistant professor at University College Dublin (UCD) and since 2017 has been an associate professor at the university’ School of Mathematics and Statistics.

What inspired you to become a researcher?

As is typical, my mathematics education was very examination-based. However, in my final undergraduate year I had the option to do a research project in statistics. I signed up and that decision changed my future. Up to then, I thought I wanted to work in one of the shiny financial offices in the centre of Dublin or London. About a week into the project, I realised I wanted to be a research scientist.

The project involved analysing Irish third-level colleges’ applications data, searching for subgroups of applicants (known as clusters) to try to answer the question of whether or not the so called ‘points race’ was real. The interesting part for me was the realisation that analysing the data was not straightforward; the data are lists of (different numbers of) courses.

Can you tell us about the research you’re currently working on?

I’m a methodological statistician, with an applied focus. My interest is in the development of statistical methods (or models) and of ways to estimate them, where the need for such methods is motivated by a real applied problem.

A current example – with my PhD student Keefe Murphy, Prof Brendan Murphy and Prof Raffaella Piccareta of Bocconi University, Milan – is the development of a statistical method to analyse sequence data recording the education and career trajectories of people from Northern Ireland.

Another research area – in collaboration with postdoctoral fellow Dr Silvia D’Angelo and Prof Lorraine Brennan of UCD – is the estimation of true dietary intake from self-report dietary data.

Brennan’s group develops metabolomic biomarkers as more objective measures of food intake. Silvia and I are developing a statistical model that uncovers the relationship between the self-report dietary data and the biomarkers.

I am also co-directing the recently announced Science Foundation Ireland (SFI) Centre for Research Training (CRT) in foundations of data science. This €21m CRT – funded by SFI with University of Limerick, Maynooth University, industry partners and Skillnet Ireland – will provide a world-class environment, training 139 PhD students in applied mathematics, statistics and machine learning, with application areas of national importance.

In your opinion, why is your research important?

So many fields of research, social policy, health policy, personal and business rely on the use of statistical methods to analyse data that has been collected to assist in the decision-making process.

From deciding on whether or not you should be approved for a loan to deciding on the probability of you having a disease, statistical methods are an often latent ingredient in our everyday lives. And, contrary to current commentary that suggests such things are new, statistics has been around for hundreds of years.

The Romans collected and explored data. Data may now be a buzzword, but it has been statisticians’ bread and butter for decades and – perhaps unbeknownst to many – has been a silent part of everyday life for many years.

Without research that develops statistical methods that are apposite, reproducible and help inform the decision by quantifying the uncertainty inherent in the data, any modern society cannot function nor progress.

What commercial applications do you foresee for your research?

So many! Being a data scientist is the thing to be these days. Every government department, business, state body that employs data scientists is essentially commercialising statistical methods. From my own research, the work I have done on developing clustering methods for data of mixed type has potential commercial applications in fields as varied as customer segmentation to disease diagnosis.

Statistics is a very collegiate science; many of the methods we develop we make freely available for use by others through open source software, most often the R statistical computing environment. My students and I have contributed several such R packages.

What are some of the biggest challenges you face as a researcher in your field?

Keeping up with the pace of advancements in statistics. It’s a vibrant field with really excellent researchers.

Also, keeping the world ‘data humble’ in the face of the data deluge. Lots of data is not necessarily good. Lots of good data is the holy grail, but lots of bad data is very dangerous. Very often what we have is poor-quality data from which we need to derive a decision. It’s in such cases that we need to be humble and honest about the inherent uncertainty in the final decision, and this is where the science of statistics strongly contributes. Dampening expectations in the data era, without being viewed as curmudgeons, is a challenge.

Are there any common misconceptions about this area of research?

Often it is assumed that statistical researchers are essentially statistical consultants. Statistics is a mathematics-based discipline, with computer programming the language that we use to turn the mathematics into usable methods that can be applied to analyse data.

Our interest as statistical researchers is in the development of the statistical methods, most typically driven by a real application or dataset, but not usually simply implementing off-the-shelf statistical methods that are widely available.

I often get amazed reactions when people realise there are statistics peer-reviewed journals, and it is publications in such venues that are valued in the statistics community. Another misconception is one my brother often kindly berates me with: “Have I still not figured out how to compute percentages yet?”

What are some of the areas of research you’d like to see tackled in the years ahead?

Much of my own research focuses on developing statistical models for data of mixed type, such as when you record data on subjects that involves some numbers, some categorical features etc.

Such ‘multimodal’ data is becoming more and more common as data becomes cheaper to collect, yet there is a dearth of principled statistical methodologies available for their apposite analysis. Statistical computation is also a ripe area. As data size grows, can statistical methods evolve accordingly?

Are you a researcher with an interesting project to share? Let us know by emailing editorial@siliconrepublic.com with the subject line ‘Science Uncovered’.