Linear Integer Programming Approaches for Chloroplast Genome Scaffolding
Victor Epain  1@  , Rumen Andonov  2@  
1 : Inria Rennes – Bretagne Atlantique
Institut National de Recherche en Informatique et en Automatique, GenScale
2 : Institut de Recherche en Informatique et Systèmes Aléatoires
Universite de Rennes 1, GenScale

1 Introduction

Genome assembly aims to assemble one genome DNA strand from a large amount of small DNA sequences denoted as reads. Due to sequencing technologies, and the fact that the DNA molecule is composed of two complemented strands, the reads must be considered in two exclusive orientations. Arbitrary, the as-input state is denoted forward orientation, while the other state is denoted the reverse one. The reverse read's orientation is obtained by reversing and complementing (DNA alphabet bijection) its nucleotide sequence.

The genome assembly process can be roughly separated in two main steps. First, the oriented reads are merged into larger sequences named contigs (also oriented). They correspond to paths in a directed graph modelling successions between the oriented reads. An upper bound of the number of time a contig is repeated in the genome, denoted multiplicity, is also provided. Our study exclusively focuses on the second stage of the genome assembly, scaffolding, in which contigs are put in correct order and orientation towards the completion of the final assembly. Usually this step uses additional data e.g. mate pairs distances, or homology references from near-species. In contrast to that, we demonstrate here that chloroplasts genomes can be assembled without any raw additional data, but only using the available biological knowledge on some chloroplast genome structures [2]. As a case example, Figure (1) illustrates one of them, in which two genomic regions are repeated, but one's sequence is the reverse-complement of the other (inverted repeats).


2 Method

The first step outputs a directed graph. Each contig gives two (forward and reverse orientation) times its multiplicity vertices.
So each vertex corresponds to one orientation (forward or reverse) of a contig, enriched by an occurrence integer (from 0 to multiplicity - 1). Edges represent oriented multiplied contigs' successions. We model the genome assembly as an elementary circuit in this graph, see Figure (2). We formulate the inverted repeats with linear constraints, and we search for such a circuit using Integer Linear Programming similarly to [1].

Inverted repeats correspond to sequences of occurrences of contigs paired with other occurrences of them but in reverse orientation, see Figure (1). Therefore, paired contigs' positions on the assembled sequence must satisfy nested-pairs pattern (the blue dotted lines are parallel in Figure 1). The goal is to find the longest contiguous inverted repeats. We formulate the above constraints in terms of linear program where the objective is to maximise the nested-pairs number. It results in finding a couple of the longest contiguous inverted repeats. Thus, we generalise asimilar approach applied for RNA folding [4]. However, in contrast to the later approach where the vertices correspond to bases with known sequence indices, in our case the positions of the contigs are variables. Our tool is implemented with Python 3 and uses the open-source PuLP package which integrates a free solver CBC to solve the above optimisation problem.


3 Results

We verified our method with the well-known assembly evaluation tool QUAST [3] on a dataset that contains 42 chloroplast genome with inverted repeats, the same as in [5]. We ran the 42 instances on a laptop (16GB RAM, 8 cores), and we obtained very encouraging preliminary results, with high genome coverage (mostly > 99%), and very low mismatches and indels rates (metrics computed from sequences alignment between the known genome reference and the sequences produced by our tool). Assuming that all successions between contigs are provided, our approach permits finishing pre-assembled genomes in just a few seconds, for graphs that not exceed 154 vertices and 240 edges and a dozen of candidate nested pairs.


Personnes connectées : 23 Vie privée
Chargement...