Number of Mismatches and Length of Longest Match Correlate with Alignment Score in Swalign Built-in Function in MATLAB
Author(s): Wenfa Ng
Understanding how one sequence relates to another at the nucleotide or amino acid level allows the derivation of new knowledge regarding the provenance of particular sequence as well as the determination of consensus sequence motifs that informs biological conservation at the sequence level. To this end, local or multiple sequence alignments tools in bioinformatics have been developed to automatically profile two or more nucleotide or amino acid sequence in search of matches in stretches of nucleotides or amino acid sequence that yield an alignment. While alignment score is a common metric for assessing alignment quality, relative difference between alignment scores does not readily correlate with concrete measures such as number of mismatches and length of longest match in alignment. Thus, using swalign local sequence alignment function in MATLAB on 200 alignments between RNA-seq sequence read and reference Escherichia coli K-12 MG1655 genome sequence in the sense and antisense direction, this work sought to shed some light on how alignment score from swalign correlates with number of mismatches and length of longest match. Results revealed that number of mismatches negatively correlate with alignment score; thereby, validating theoretical predictions that larger number of mismatches would result in a poorer alignment and lower alignment score. However, dependence of alignment score on other factors such as length of longest match and gap penalty from opening an alignment gap prevents linear relationship to be obtained between number of mismatches and alignment score. On the other hand, length of longest match was found to positively correlate with alignment score as predicted from theoretical understanding. But, data obtained revealed that clusters of data points gather at two regions of the scatter plot involving short matches and low alignment score, as well as long matches and high alignment score. Such clustering and sparseness of