

The Shortest Superstring Problem (SSP) is a combinatorial optimization problem which has attracted the interest of many researchers due to its applications. The Renyi entropy rate of the DNA sequence. Read length is above the threshold, having enough reads to cover the DNA Reconstruction is impossible no matter how many reads are observed, and if the Limit of long DNA sequences: if the read length is below a threshold, Process, we show that the answer admits a critical phenomena in the asymptotic For a simple statistical model of the DNA sequence and the read We show that the approximation guarantee of GREEDY is at most $(13+\sqrt assemblyĪlgorithm. In a seminal work, Blum, Jiang, Li, Tromp, and Yannakakis (STOC 1991) proved that the superstring computed by GREEDY is a 4-approximation, and this upper bound was improved to 3.5 by Kaplan and Shafrir (IPL 2005). Tarhio and Ukkonen (TCS 1988) conjectured that GREEDY gives a 2-approximation. The GREEDY algorithm, being simpler than other well-performing approximation algorithms for this problem, has attracted attention since the 1980s and is commonly used in practical applications. Of particular interest is the GREEDY algorithm, which repeatedly merges two strings of maximum overlap until a single string remains. The Shortest Superstring problem is NP-hard and several constant-factor approximation algorithms are known for it. In the Shortest Superstring problem, we are given a set of strings and we are asking for a common superstring, which has the minimum number of characters. This leads to a new version of the greedy conjecture. Here, we present a novel approach to bound the superstring approximation ratio with the compression ratio, which, when applied to the greedy algorithm, shows a approximation ratio for -SSP, and also that greedy achieves ratios smaller than. In contrast the greedy conjecture asked in 1988 whether a simple greedy algorithm achieves ratio of for SSP. Numerous involved approximation algorithms achieve approximation ratio above for the superstring, but remain difficult to implement in practice. Even the variant in which all words share the same length, called -SSP, is NP-hard whenever. Unfortunately, SSP is known to be NP-hard even on a binary alphabet and also hard to approximate with respect to the superstring length or to the compression achieved by the superstring. Indeed, it models the question of assembling a genome from a set of sequencing reads. SSP is an important theoretical problem related to the Asymmetric Travelling Salesman Problem, and also has practical applications in data compression and in bioinformatics. Given such a set, the Shortest Superstring Problem (SSP) asks for a superstring of minimum length. A superstring of a set of words is a string that contains each input word as a substring.
