Multiobjective Sequence Alignment Formulation, Algorithms and Application

Abbasi, Maryam

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/87604

Title:	Multiobjective Sequence Alignment Formulation, Algorithms and Application
Authors:	Abbasi, Maryam
Orientador:	Paquete, Luís
Keywords:	Sequence Alignment; Multiobjective Optimization; Dynamic Programming; Phylogenetic Trees; Combinatorial Optimization; Alinhamento de sequências; Otimização multiobjetivo; Programação Dinâmica; Árvores filogenéticas; Otimização Combinatória
Issue Date:	18-Mar-2019
Project:	eu-repo/grantAgreement/FCT/SFRH/SFRH%2FBD%2F91451%2F2012/PT eu-repo/grantAgreement/FCT/5876-PPCDTI/98674/PT
Abstract:	Sequence alignment is a standard technique in bioinformatics to measure the relationship between evolutionary or structurally related DNA/proteins sequences. Most modern programs for sequence alignment optimize a given objective function that is a convex combination of how many gaps need to be inserted into the sequences and how many characters become aligned. Clearly, depending on the weights given to each of the two components, different optimal alignments can be obtained. Therefore, choosing only one weight setting may provide an undesirable bias in further steps of the analysis, such as phylogenetic tree construction, and provide too simplistic interpretations. In this thesis, we take a different point of view on the mathematical formulation of the sequence alignment problem. Rather than considering the optimization of a scalar score function, resulting from a weighted sum of components, we consider a vector score function with the goal of optimizing, simultaneously, the different score components. This brings us to the topic of multiobjective optimization, which deals with the mathematical formulation of optimization problems with several conflicting objectives as well as with algorithms to solve them. Under this new formulation, these algorithms return a set of non-dominated alignments, each of which representing a trade-off between the several components. This set gives further information about the similarity of the sequences, from which a practitioner could analyze and choose the most plausible alignment. We consider the biobjective pairwise sequence alignment problem and propose extensions of efficient dynamic programming algorithms for several variants of this problem. We propose a novel pruning technique that substantially reduces the computation time and memory usage. Moreover, we consider a biobjective variant of this problem with more than two sequences, which is computationally intractable. We introduce local search techniques for this problem and conduct an in-depth experimental analysis on a wide range of benchmark instances. Based on the hypervolume indicator and empirical attainment function methodology, we establish functional relationships between algorithm performance and instance features. Finally, we present a method that uses multiobjective concepts for the construction of phylogenetic trees. We test this method on two real-life cases and show that the number of distinct phylogenetic tree topologies obtained is very small. This work shows that multiobjective concepts can successfully be applied to the sequence alignment problem and identifies which approaches can be used for the several variants of this problem. We believe that the methods proposed in this thesis, by providing more information about the relationship between biological sequences than the current known procedures, can be of great value to a broad range of research communities as well as to practitioners in the field. O alinhamento de sequências é um procedimento utilizado na Bioinformática que tem por objetivo medir a semelhança entre sequências de DNA ou proteínas relacionadas entre si de uma forma evolutiva ou estrutural. As aplicações atuais para alinhamento de sequências otimizam uma determinada função objetivo que resulta da combinação convexa da quantidade de espaços a inserir nas sequências e da quantidade de caracteres que ficam alinhados. Dependendo das ponderações atribuídas a cada um destes dois componentes, diferentes alinhamentos podem ser obtidos. Desta forma, a escolha de uma só ponderação pode enviesar, indesejadamente, os passos seguintes da análise, por exemplo, na construção de árvores filogenéticas, e fornecer interpretações demasiado simples. Esta tese aborda o problema de alinhamento de sequências de uma forma diferente na perspetiva de formulação matemática. Em vez da otimização de uma função escalar que resulta de uma soma ponderada das componentes, considera-se uma função vetorial em que se pretende otimizar, simultaneamente, as suas componentes. O estudo destes problemas é abordado em otimização multi-objetivo, que lida com as formulações matemáticas de problemas de otimização com vários objetivos conflituosos entre si e com algoritmos para a sua resolução. Com esta nova formulação, os algoritmos retornam um conjunto de alinhamentos não-dominados, cada um representando um compromisso entre as várias componentes da função objetivo. Este conjunto, ao fornecer mais informação acerca da semelhança entre as sequências em análise, permite, ao profissional, escolher o alinhamento mais plausível. Neste estudo, considera-se o problema bi-objetivo de alinhamento emparelhado de sequências e variantes deste problema, para os quais propõem-se extensões de algoritmos eficientes baseados em programação dinâmica. Propõe-se iguamente uma variante bi-objetivo deste problema para mais do que duas sequências, que é considerado um problema computacionalmente intratável. Por esta razão, apresentam-se técnicas de procura local para este problema. Estes algoritmos são analisados experimentalmente num conjunto de instâncias de referência. Com base no indicador de hipervolume e na metodologia das funções de aproveitamento, estabelecem-se relações funcionais entre o desempenho dos algoritmos e características destas instâncias. Finalmente, apresenta-se um método que utiliza conceitos de otimização multi-objetivo para a construção de árvores filogenéticas. Este método é testado em dois casos reais. Os resultados obtidos indicam que o número de topologias distintas de árvores filogenéticas é bastante pequeno. Este estudo mostra que os conceitos multi-objetivo podem ser utilizados com sucesso no problema de alinhamento de sequências e permite identificar quais as abordagens que podem ser utilizadas para cada uma das variantes apresentadas. Ao fornecer mais informação acerca da relação entre as sequências biológicas do que os métodos atuais, espera-se que as contribuições desta tese possam de grande valor tanto para a comunidade académica como para os profissionais de Bioinformática.
Description:	Tese de Doutoramento em Ciências e Tecnologias da Informação, apresentada ao Departamento de Engenharia Informática da Faculdade de Ciências e Tecnologia da Universidade de Coimbra
URI:	https://hdl.handle.net/10316/87604
Rights:	embargoedAccess
Appears in Collections:	FCTUC Eng.Informática - Teses de Doutoramento

Files in This Item:

File	Description	Size	Format
Multiobjective Sequence Alignment Formulation.pdf		18.88 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM