Clinical Applications of Long-Read Sequencing

Second- or next-generation sequencing (NGS) technologies have been around for two decades. Though relatively inexpensive and accurate, NGS produces short reads, which can miss certain regions of the genome and fail to identify the genetic cause of a disease. To overcome this challenge, researchers have turned to third-generation or long-read sequencing (LRS) technologies.

As implied by their names, the key difference between the two high-throughput sequencing technologies is the length of the reads they produce. NGS or short-read sequencing (SRS) typically produces reads that are less than 1,000 base pairs long, whereas LRS reads can sometimes reach hundreds of kilobases.

The need for LRS arose due to the nature of DNA itself: because it is organized in long molecules, sequencing short fragmented reads means they need to be reassembled into the correct sequence—a complex task. While short reads work well to identify single nucleotide variants (SNVs) and small insertion or deletion mutations, when it comes to large changes in DNA sequence, known as structural variations (SVs), short-read sequencing is often insufficient as the altered sequence is usually longer than 1,000 base pairs. SVs include gene duplications, fusions, translocations, inversions, and copy number variations.

Recent research using LRS shows that human genomes can harbor over 20,000 SVs, which can go undetected using NGS. Thus, SVs have been largely understudied compared to SNVs.

SVs are important because they can alter gene function and are involved in many genetic diseases. For example, in Parkinson’s disease, ATTCC repeat extensions are found within intron 9 of the ATXN10 gene, which is involved in neuron survival and differentiation, while CAG trinucleotide repeat expansions in the huntingtin gene are responsible for Huntington’s disease.

LRS can cover the entire sequence of large SVs, capturing the bigger picture better than short reads. Long reads also lead to a more accurate alignment to a reference genome or for de novo genome assembly, making it easier to study uncharacterized parts of the genome.

Clinical applications of long-read sequencing

In the clinic, LRS is well suited for diagnostics and personalized medicine. In particular, it can make an impact in the area of targeted sequencing, where clinicians look for genetic variations within specific areas of the genome, some of which are not accessible by NGS.

Recent studies have shown that LRS can be used to identify disease-causing mutations in human diseases with no previously known genetic cause. LRS can also be used to further characterize large changes in DNA sequence, such as tandem repeats in fragile X syndrome and spinocerebellar ataxia, or identifying SVs involved in Mendelian diseases.

Moreover, LRS can be used to map complex genomic areas. For example, researchers created a complete map of SVs that contribute to HER2+ breast cancer, leading to the identification of novel gene fusions involved in the disease.

In terms of personalized cancer medicine, LRS has been used to identify low frequency resistance mutations in BCR-ABL1 that affect treatment efficacy in patients with chronic myeloid leukemia, identifying mutations that were not detected with the usual clinical methods.

Researchers are also using LRS for histocompatibility leucocyte antigen (HLA) typing prior to organ and stem cell transplantation to accurately match donors and recipients based on their HLA profiles. Because a single nucleotide change can define a person’s HLA type, using short reads that need to be pieced together to form a complete sequence can lead to errors. In the past, this meant having to conduct several sequencing experiments on different platforms before cross-referencing the results to ensure accuracy. However, with LRS, researchers can accurately cover the entire HLA class I gene region at once in a matter of hours. In another study, researchers discovered eight novel HLA antigen E alleles, adding them to the IPD-IMGT/HLA Database, which has more than 25,000 allele sequences for human HLA genes. Better matches could mean better outcomes for patients undergoing transplantation.

Reference genomes, pseudogenes, and COVID-19

In terms of clinical research, LRS can also be used to fill in gaps in the human reference genome and to create reference genomes for populations of non-European decent. The first Chinese and Korean human reference genomes created using LRS revealed previously unreported population-specific differences.

These findings are significant for populations that are underrepresented in reference sequence databases, especially when it comes to clinical applications like HLA typing. Through initiatives such as the HLA diversity in Africa project, researchers have discovered new sequences in HLA class I and class II genes*, an essential resource for HLA typing in people of African ancestry.

LRS can be used to differentiate genes from pseudogenes, such as in autosomal-dominant polycystic kidney disease, where diagnosis is challenging because the PKD1 gene is homologous to six pseudogenes (nonfunctional DNA that resembles functional genes).

It can also be used to sequence the entire genome of smaller organisms in a single read, including viruses. Most recently, LRS has been used to sequence and create a highly specific test for SARS-CoV-2, the virus that causes COVID-19.

Advantages and disadvantages of LRS

One major advantage of LRS is that it reduces amplification bias. Usually when performing targeted NGS, genetic material is first amplified by PCR, which can introduce errors, but LRS does not use PCR amplification. In addition, because the DNA is not amplified and remains in its native state, LRS can detect relevant epigenetic changes, such as DNA methylation patterns.

However, there are a few drawbacks to LRS. For one, data produced using LRS is different than the data produced by NGS and therefore requires different analysis tools. There have been over 469 tools identified to date on long-read-tools.org.

Though over 20 percent of these analysis tools are dedicated to error correction, researchers have improved LRS technologies in recent years. Compared to previously reported error rates of about 10 percent (compared to 0.1 to 1.0 percent using NGS), recent claims report a less than one percent error rate for Pacific Biosciences’s single-molecule real-time sequencing and a less than five percent error rate for Oxford Nanopore Technologies’s MinION sequencer.

To get around higher error rates, some researchers have combined the two types of sequencing technologies, where short reads from NGS ensure accuracy at the individual nucleotide level, while long reads generate a complete picture of a genome.

LRS can improve clinical care

Compared to short-read sequencing, LRS is currently limited by lower throughput, a somewhat higher error rate, and higher cost. But as researchers work to improve LRS, increasing throughput and lowering the cost, wider use of LRS in a clinical context may rapidly improve our understanding of cancer, pathogen evolution, drug resistance, and genetic diversity in complex regions of the genome that have important implications for clinical care.

References:

*Pollard, M., Tommy, C., Cristina, P., Gurdasani, D. and Investigators, G. (2017). The MHC diversity in Africa resource: a roadmap to understanding HLA diversity in Africa. Presented at: The 67th Annual Meeting of The American Society of Human Genetics, Orlando, FL, USA.