Scientists say using math to sort through DNA could help investigators put stubborn cold cases to rest. The approach combines the relatively new field of forensic genetic genealogy – solving crime by charting out DNA-based family trees – with increasing computational power to speed up and simplify this complex form of investigation.In a new paper recently published in the Journal of Forensic Sciences, researchers from Stanford University, California-based Identifinders, and the DNA Doe Project explain how they developed a new mathematical model to help investigators greatly narrow down their giant pools of genetic candidates:”We formulate a program that – given the list of matches and their genetic distances to the unknown target – chooses the best decision at each point in time: which match to investigate, which set of potential most recent common ancestors to descend from, or whether to terminate the investigation.” By using a decision tree to optimize the candidate search, the researchers say their new process improves the existing process for forensic genetic genealogy by a factor of 10. They can also use this protocol to pull relevant matches even from large pools with a low likelihood of success. In fact, the new algorithm is so effective that researchers say it “can solve a case with a 7,500-person family tree around 94% of the time,” compared to only 4% of the time with the current method, according to a Stanford University press release. Basically, it’s a great way to speed up and enrich the research investigators are already doing – like turning your regular bicycle into an e-bike. Genetic Genealogy Takes on CrimeGenetic genealogy is the term for combining DNA testing with traditional genealogy to create family trees on a genetic basis – think at-home genetic testing like 23andMe combined with Ancestry.com (which now offers its own DNA testing). It’s also used to test unknown exhumed remains against modern descendants. Genetic genealogy becomes forensic when it’s applied to solving a crime. The applications for this genetic information are easy enough to see. If an unknown deceased person is found or DNA from a criminal suspect can’t be identified via traditional means, police may take that genetic information and then cross-check it against other data – like what’s known about missing persons at the time. When direct genetic information isn’t available, they can ask close relatives and look for the percentage of shared DNA that indicates a family relationship. On average, a person shares roughly 25% of their DNA with a grandparent, 12.5% with a first cousin, and 3.13% with a second cousin.When millions of people began buying and submitting at-home genetic testing kits, that information was largely made available to law enforcement, despite ongoing questions of legality. That means police now have access to a much larger DNA pool, which they can use to find matches for unidentified victims or suspects of violent crime. In 2018, investigators used forensic genetic genealogy to split open a major case for the very first time: capturing the Golden State Killer. In that case, one man—himself a former police officer—committed at least 13 murders, 51 rapes, and dozens of burglaries and other crimes in California throughout the 70s and 80s. Because of the variety of crimes and wide geographical area, investigators only consolidated all three major streaks into one file they named the “Golden State Killer” in 2013, decades after the crimes ended in 1986. Police combined DNA databases and made many different family trees, ranging as far back as the 1800s, then narrowed down the suspects to just one. Like Finding a Needle in a HaystackSo far, Stanford University reports, forensic genetic genealogy has been used to solve over 400 crimes. But the process is tedious, and it’s mostly been undertaken by individuals who felt committed to seeing the process through. And you might be thinking, correctly, that the process is ripe for the application of some raw computing power. Isn’t genetic information just a big list or database, ready to search?That’s not exactly wrong, but it’s not the whole story. Genetics are messy and enormous. Family relationships get a lot less noticeable and identifiable very quickly as you move away from the immediate family group. The researchers used data from 17 actual cases to test their model. In each case, the target’s DNA – that of the suspect or the victim – produced anywhere from 200 to 5,000 matches.”It is not obvious how many matches, and which of these matches, to investigate, nor is it obvious how to optimally look for an intersection among their families,” the authors write in the study. And so, while we have more computing power than ever before, investigators still need help structuring their searches. This is where the decision-making math comes in.A decision tree is kind of like a game of Guess Who? In this iconic children’s game, a full docket of people share certain traits like hair and eye color, glasses, or facial hair.Players ask each other eliminating questions – Is your person blond? Do they have brown eyes? Then flip down the candidates they’ve eliminated. But instead of following visible genetic traits, the algorithm looks at the underlying genomes of the matches and their possible relationship with the target. At each juncture, the researchers’ model makes a decision on which lead to pursue.A More Efficient, Mathematical ApproachThe researchers took a different approach, which they refer to as their “proposed strategy,” over that of the current method, which they call the “benchmark strategy.” “The benchmark method looks for common ancestors between different matches. What you really want to find is the most recent common ancestor between a match and the unknown target, and that’s a slightly different problem,” Lawrence Wein, one of the study authors and a professor of operations, information, and technology at Stanford University, says in the release. According to the researchers, their proposed method is far more efficient because it significantly reduces the overall workload and number of dead-end leads. As for the math used to help parse through all the genetic data, the researchers created a two-part algorithm that is a kind of stochastic dynamic program, which they define as “the standard approach to solving multi-period optimization problems under uncertainty.”At every step, the program uses probability while prioritizing the most cost-effective matches. In part, it does this by using the Autocluster tool from GEDmatch, which groups “DNA matches of people who have a common ancestor and likely belong to the same branch of the family tree” according to the company. The algorithm also uses “probabilistic information about the relationship between the target and the match,” according to the study. (The algorithm allows for quite a bit of leeway, too, and even matches with little probability of success are explored.) Meanwhile, the current benchmark method uses neither of those, and requires manual legwork from investigators to determine which DNA-match leads to pursue. The first step looks at each “generation-ancestral couple pair” – identified matches that had offspring together – and assesses the probability of finding the target by working downward from that pair. If a pair’s cost-effectiveness value passes the threshold the researchers set, then it’s worth investing time into looking at the descendants of that pair in hopes of identifying the target. In the second step, if a matched pair doesn’t meet the threshold value – if the algorithm deems it improbable that the target is their direct descendant – the algorithm will then work upward in the family tree from that point, then downward again if it finds any promising candidates until the most recent common ancestor(s) of the unknown target is found. The Future of Solving Crimes?So what the researchers from Stanford and elsewhere have done is use mathematical inference to get a huge head start on the game by calculating how likely each candidate is based on the information at hand. They describe it as a kind of “roadmap” for investigators to follow. That means the questions investigators ask after that can be smarter, more specific, and more impactful in their investigations.Still, the researchers point out that their method can’t fully replace the work done by genealogists, who may use more case-specific information, like location, in their search. And investigators still have to put in the time to solve the case and attain justice. But the results of the study certainly speak for themselves. With a model that’s purportedly ten times better than what we have now, that list of 400 solved cases could soon grow by quite a bit – and very quickly.Additional reporting by Jessica Coulon.
Scientists say using math to sort through DNA could help investigators put stubborn cold cases to rest. The approach combines the relatively new field of forensic genetic genealogy – solving crime by charting out DNA-based family trees – with increasing computational power to speed up and simplify this complex form of investigation.
In a new paper recently published in the Journal of Forensic Sciences, researchers from Stanford University, California-based Identifinders, and the DNA Doe Project explain how they developed a new mathematical model to help investigators greatly narrow down their giant pools of genetic candidates:
“We formulate a program that – given the list of matches and their genetic distances to the unknown target – chooses the best decision at each point in time: which match to investigate, which set of potential most recent common ancestors to descend from, or whether to terminate the investigation.”
By using a decision tree to optimize the candidate search, the researchers say their new process improves the existing process for forensic genetic genealogy by a factor of 10. They can also use this protocol to pull relevant matches even from large pools with a low likelihood of success. In fact, the new algorithm is so effective that researchers say it “can solve a case with a 7,500-person family tree around 94% of the time,” compared to only 4% of the time with the current method, according to a Stanford University press release.
Basically, it’s a great way to speed up and enrich the research investigators are already doing – like turning your regular bicycle into an e-bike.
Genetic Genealogy Takes on Crime
Genetic genealogy is the term for combining DNA testing with traditional genealogy to create family trees on a genetic basis – think at-home genetic testing like 23andMe combined with Ancestry.com (which now offers its own DNA testing). It’s also used to test unknown exhumed remains against modern descendants. Genetic genealogy becomes forensic when it’s applied to solving a crime.
The applications for this genetic information are easy enough to see. If an unknown deceased person is found or DNA from a criminal suspect can’t be identified via traditional means, police may take that genetic information and then cross-check it against other data – like what’s known about missing persons at the time. When direct genetic information isn’t available, they can ask close relatives and look for the percentage of shared DNA that indicates a family relationship. On average, a person shares roughly 25% of their DNA with a grandparent, 12.5% with a first cousin, and 3.13% with a second cousin.
When millions of people began buying and submitting at-home genetic testing kits, that information was largely made available to law enforcement, despite ongoing questions of legality. That means police now have access to a much larger DNA pool, which they can use to find matches for unidentified victims or suspects of violent crime.
In 2018, investigators used forensic genetic genealogy to split open a major case for the very first time: capturing the Golden State Killer. In that case, one man—himself a former police officer—committed at least 13 murders, 51 rapes, and dozens of burglaries and other crimes in California throughout the 70s and 80s. Because of the variety of crimes and wide geographical area, investigators only consolidated all three major streaks into one file they named the “Golden State Killer” in 2013, decades after the crimes ended in 1986. Police combined DNA databases and made many different family trees, ranging as far back as the 1800s, then narrowed down the suspects to just one.
Like Finding a Needle in a Haystack
So far, Stanford University reports, forensic genetic genealogy has been used to solve over 400 crimes. But the process is tedious, and it’s mostly been undertaken by individuals who felt committed to seeing the process through. And you might be thinking, correctly, that the process is ripe for the application of some raw computing power. Isn’t genetic information just a big list or database, ready to search?
That’s not exactly wrong, but it’s not the whole story. Genetics are messy and enormous. Family relationships get a lot less noticeable and identifiable very quickly as you move away from the immediate family group.
The researchers used data from 17 actual cases to test their model. In each case, the target’s DNA – that of the suspect or the victim – produced anywhere from 200 to 5,000 matches.
“It is not obvious how many matches, and which of these matches, to investigate, nor is it obvious how to optimally look for an intersection among their families,” the authors write in the study. And so, while we have more computing power than ever before, investigators still need help structuring their searches. This is where the decision-making math comes in.
A decision tree is kind of like a game of Guess Who? In this iconic children’s game, a full docket of people share certain traits like hair and eye color, glasses, or facial hair.
Players ask each other eliminating questions – Is your person blond? Do they have brown eyes? Then flip down the candidates they’ve eliminated. But instead of following visible genetic traits, the algorithm looks at the underlying genomes of the matches and their possible relationship with the target. At each juncture, the researchers’ model makes a decision on which lead to pursue.
A More Efficient, Mathematical Approach
The researchers took a different approach, which they refer to as their “proposed strategy,” over that of the current method, which they call the “benchmark strategy.”
“The benchmark method looks for common ancestors between different matches. What you really want to find is the most recent common ancestor between a match and the unknown target, and that’s a slightly different problem,” Lawrence Wein, one of the study authors and a professor of operations, information, and technology at Stanford University, says in the release. According to the researchers, their proposed method is far more efficient because it significantly reduces the overall workload and number of dead-end leads.
As for the math used to help parse through all the genetic data, the researchers created a two-part algorithm that is a kind of stochastic dynamic program, which they define as “the standard approach to solving multi-period optimization problems under uncertainty.”
At every step, the program uses probability while prioritizing the most cost-effective matches. In part, it does this by using the Autocluster tool from GEDmatch, which groups “DNA matches of people who have a common ancestor and likely belong to the same branch of the family tree” according to the company. The algorithm also uses “probabilistic information about the relationship between the target and the match,” according to the study. (The algorithm allows for quite a bit of leeway, too, and even matches with little probability of success are explored.) Meanwhile, the current benchmark method uses neither of those, and requires manual legwork from investigators to determine which DNA-match leads to pursue.
The first step looks at each “generation-ancestral couple pair” – identified matches that had offspring together – and assesses the probability of finding the target by working downward from that pair. If a pair’s cost-effectiveness value passes the threshold the researchers set, then it’s worth investing time into looking at the descendants of that pair in hopes of identifying the target.
In the second step, if a matched pair doesn’t meet the threshold value – if the algorithm deems it improbable that the target is their direct descendant – the algorithm will then work upward in the family tree from that point, then downward again if it finds any promising candidates until the most recent common ancestor(s) of the unknown target is found.
The Future of Solving Crimes?
So what the researchers from Stanford and elsewhere have done is use mathematical inference to get a huge head start on the game by calculating how likely each candidate is based on the information at hand. They describe it as a kind of “roadmap” for investigators to follow. That means the questions investigators ask after that can be smarter, more specific, and more impactful in their investigations.
Still, the researchers point out that their method can’t fully replace the work done by genealogists, who may use more case-specific information, like location, in their search. And investigators still have to put in the time to solve the case and attain justice.
But the results of the study certainly speak for themselves. With a model that’s purportedly ten times better than what we have now, that list of 400 solved cases could soon grow by quite a bit – and very quickly.
Additional reporting by Jessica Coulon.