The use of consumer genomics databases, which were instrumental in catching the Golden State Killer, could be used to identify almost anyone in the future, according to the latest research.
When notorious serial rapist and murderer Joseph James DeAngelo, also known as the Golden State Killer, was apprehended by using genealogy databases after years of eluding authorities, the question of genetic privacy was sidelined. People wanted to celebrate the fact that cutting-edge technology had been successfully leveraged to catch a man who had terrorised the state of California for decades.
Yet research published yesterday (11 October) in Science has produced results that should give people pause. The method used to reveal DeAngelo’s identity could some day make it theoretically possible to identify any person by tracing their distant relatives.
These relatives would be so distant – third cousins or further out – that they likely would be unknown to the person being identified – in other words, total strangers. It could be possible to do this regardless of whether the person in question had submitted their own DNA.
A new frontier for forensic identification
The use of DNA in forensic investigation became popular in the 1990s. The method relies on matching pieces of DNA discovered in samples found at crime scenes, such as blood or hair, to victims or perpetrators. It can only identify the person themselves or a close relative, and the samples can only be checked against heavily regulated databases.
The method used to hone in on the Golden State Killer is different for a few reasons. It took DNA collected from one of his crime scenes and checked it against publicly available data from a variety of consumer genetic testing websites such as GEDmatch and MyHeritage. People submit their DNA to these services to gain more insight into their ancestral identity or to be connected to long-lost family members.
Using the findings from this process, researchers were able to confirm that DeAngelo was the correct age profile and lived in an area near where the crimes occurred. Once they corroborated their findings with the DNA DeAngelo left on a car door handle while out shopping, the arrest was made.
Inspired by this specific instance, Columbia University computer engineer Yaniv Erlich and his team decided to investigate how easy it would be to use this method to discover someone’s identity. They examined anonymised data from 1.2m people who had gotten testing from MyHeritage, a site at which Erlich is chief science officer. The team excluded anyone who had immediate family members also in the database so that they would be solely using a stranger’s DNA to identify someone.
They found that more than half of these people could be spotted via distant relatives. The hit rate was closer to 60pc for people of European descent, who made up 75pc of the sample. For about 15pc of the sample, the team was able to find a second cousin.
The team was then able to establish someone’s identity using both these relatives and demographic identifiers such as possible age and possible state of residence. Furthermore, the research concluded that once 2pc of the population submit data, it will be possible to identify almost anyone in the US.
The genetic privacy implications
Erlich has mixed feelings about the findings, as he explained to Gizmodo. “Of course, there’s some good news. If someone did something wrong out there, then [law enforcement] is going to be able to catch them.” Since April of this year, at least 13 criminal cases have “seemingly” been solved with the help of genealogy services, Gizmodo claimed.
Yet anyone familiar with the thorny ethical issues surrounding the intersection of privacy and safety will be able to predict Erlich’s line of thinking: this tool could be used for nefarious purposes just as easily. Researchers raised concerns that this method, if left unregulated, could be exploited by companies and individuals who may try to sell the information elsewhere.
Erlich’s research proposes measures to mitigate the risks associated with using genomic data in this way. It suggests that direct-to-consumers (DTC) providers such as MyHeritage and 23andMe should cryptographically sign the text file containing the raw data available to consumers. This way, it would enable third-party services to authenticate a raw genotyping file, ensuring it was created by a valid DTC provider and not subsequently modified. This would essentially mean that the service would then only run searches through its database after confirming the query came from a customer.
The researchers hope that their proposal, if adopted, could prevent the exploitation of long-range familial searches from DNA evidence. They have provided demo source code on GitHub to sign and verify the raw genotype files.
As home DNA tests become more popular, the implications of the use of this data for other means will only continue to get more complex and, arguably, worrying.
In July of this year, a controversial partnership was created between pharma giant GlaxoSmithKline and 23andMe in a bid to use genetic data to help them develop new medicines. Privacy experts raised concerns about GDPR implications and about placing public health data into the hands of for-profit companies.