University of Toronto researchers have developed a new computational method to study how chromosomes within human cells are organized.
The method could help advance precision medicine by shedding light on how chromosomal re-organization leads to disease development.
Professor Samin Aref (MIE) worked with a team led by Professor Philipp Maass from the Temerty Faculty of Medicine’s Department of Molecular Genetics and the SickKids Research Institute. Together, they looked at human chromosomes, which are long strands of DNA that contain genetic information.
Changes or abnormalities in the structure or arrangement of chromosomes have been linked to various diseases, including cancer. But until now, there has been limited understanding of how chromosomes are organized within the cell nucleus. The team looked at the interactions between different parts of different chromosomes to hypothesise the organization of chromosomes and validate such hypotheses using computational and imaging experiments.
In a new paper published in Nature Communications, the researchers introduce a new computational method known as Signature. This method, which relies on machine learning, has uncovered previously unknown patterns in the genome, that is, the complete set of genetic material in a human.
Imaging techniques, such as microscopy, can visualize chromosomal territories and where they overlap. But researchers can only study one cell at a time, which can be inefficient and laborious.
“You can label chromosomes, and you can use microscopy to visualize how they organize into chromosomal territories — so imagine one is green, and the other one is red, and where they are meeting and intermingling is yellow. But in the past, we could not define the genomic region where this contact point is happening,” says Maass.
“The main idea behind this project was to combine new technologies to find where these inter-chromosomal contact points are.”
While a major advantage of imaging is being able to study living cells, using high-throughput chromosome conformation capture (Hi-C) data sets allowed the researchers to analyze unbiased genomics data with up to billions of sequencing reads, giving them a vast pool of information.
The team worked with 62 different data sets, and each of them included more than 3.8 million possible interactions in one genome. To run and optimize their algorithms, they required strong computational power, which was provided by The Hospital for Sick Children (SickKids), where Maass is a scientist in the Genetics and Genome Biology program.
“The nature of genetic data is that it’s very correlated and high dimensional, so we need machine learning to extract the patterns from the data, but that required a new and cross-disciplinary method,” says Milad Mokhtaridoost, who is a postdoctoral fellow in computational biology in Maass’s lab and the first author of the new study.
The team used two different machine learning methods for analyzing the Hi-C data, one supervised and one unsupervised.
“For supervised learning, we used a regression-based model,” says Mokhtaridoost. “For unsupervised learning, we modelled all chromosomal interactions as one large network, and we used a community detection technique to partition the genome into smaller clusters so we could hypothesize the relative positions of different chromosomes within the cell.”
“One distinction between these two approaches is that in supervised learning, we know what we are looking for. And in unsupervised learning, we don’t, so we let the data speak for itself,” adds Aref.
“Unsupervised network clustering is about finding patterns in interconnected data. So, in the study, our results were first obtained using the supervised learning method and then reaffirmed using unsupervised learning. We also further validated our results with wet lab experiments.
“We were working on a topic that is highly under explored, so we wanted to make sure that our findings hold up regardless of how we looked at the problem.”
This interdisciplinary approach was crucial to making their breakthrough possible, says Mokhtaridoost, who has an industrial engineering background. Obtaining their results required expertise in genetics, data science, machine learning and microscopy.
The idea for this unsupervised machine learning approach came about after Mokhtaridoost heard Aref discuss the computational challenge of developing a new clustering algorithm in scenarios where the data is highly interdependent and the only thing that is known is the interactions between the different entities.
“Milad had told me about one of his projects with Philipp, where they were looking at interactions between chromosomes and wanted to cluster them,” says Aref.
“But network clustering has not been previously used in this context, to the best of my knowledge, despite how well suited it is for dealing with this type of big data.”
While their newly published study looked at normal, healthy genomic data sets, the researchers are now looking at data sets from different cancers.
“Now that we know how the genome is organized in healthy samples, we want to understand how a disease genome is getting re-organized, which chromosomal contacts are rearranged and how they are contributing to the disease,” says Maass.
“When we analyze disease genomes, we may be able to say how important a certain position in the genome is, creating pathways to new methods of disease prevention and treatments.”