Interval graphs with repeats and the DNA fragment assembly problem
The DNA fragment assembly problem is described as follows. There is a DNA sequence $Seq[1..L]$, over the alphabet $\{A,C,G,T\}$, which is unknown. Fragments are taken from $Seq$ and their sequence is directly determined by sequencing machines. Each fragment corresponds to an interval $[i,j]$ contained in $[1,L]$ and the sequencing machine outputs the substring $Seq[i,j]=Seq(i)Seq(i+1)\ldots Seq(j)$. Typically the size of interval $[i,j]$, which is $j-i+1$, lies in the range $500$ to $1000$, while $L$ is much larger; in some cases like, $L=50,000$, but it may reach $3,000,000,000$. Given a number of fragments we want to determine the string $Seq$. Interval Graphs have been used to model this problem, but they do not account for repeats in the sequence $Seq$. In this note we introduce a new formalism, Interval Graphs with Repeats, to address this issue.
2002