Number of Mismatches and Length of Longest Match Correlate with Alignment Score in Swalign Built-in Function in MATLAB
Author(s): Wenfa Ng
Understanding how one sequence relates to another at the nucleotide or amino acid level allows the derivation of new knowledge regarding the provenance of particular sequence as well as the determination of consensus sequence motifs that informs biological conservation at the sequence level. To this end, local or multiple sequence alignments tools in bioinformatics have been developed to automatically profile two or more nucleotide or amino acid sequence in search of matches in stretches of nucleotides or amino acid sequence that yield an alignment. While alignment score is a common metric for assessing alignment quality, relative difference between alignment scores does not readily correlate with concrete measures such as number of mismatches and length of longest match in alignment. Thus, using swalign local sequence alignment function in MATLAB on 200 alignments between RNA-seq sequence read and reference Escherichia coli K-12 MG1655 genome sequence in the sense and antisense direction, this work sought to shed some light on how alignment score from swalign correlates with number of mismatches and length of longest match. Results revealed that number of mismatches negatively correlate with alignment score; thereby, validating theoretical predictions that larger number of mismatches would result in a poorer alignment and lower alignment score. However, dependence of alignment score on other factors such as length of longest match and gap penalty from opening an alignment gap prevents linear relationship to be obtained between number of mismatches and alignment score. On the other hand, length of longest match was found to positively correlate with alignment score as predicted from theoretical understanding. But, data obtained revealed that clusters of data points gather at two regions of the scatter plot involving short matches and low alignment score, as well as long matches and high alignment score. Such clustering and sparseness of data points between the two clusters preclude the elucidation of a linear quantitative relationship between length of longest match and alignment score. Overall, dependence of alignment score of swalign on number of mismatches and length of longest match in alignment match theoretical predictions; thereby, validating the utility of alignment score in indicating the qualitative quality of alignment. However, given that alignment score inherently depends on a multitude of factors, users could not easily discern the quantitative difference in mismatches and length of longest match from relative differences between two alignment scores. Such problems are unlikely to be resolved given the near impossibility of obtaining quantitative linear relationship correlating either number of mismatches or length of longest match with alignment score of a sequence alignment tool.