RTD

Validation:

In order to validate the performance of RTD-based alignment-free method, 5 data sets, as listed in Table 1, were compiled by considering different factors such as,

  • Variation in length of sequences
  • Representation of different domains of life viz., viruses, prokaryotes and eukaryotes (Note: Classification of viruses as living or non-living is the issue of debate)
  • Variation in extent of sequence similarity due to different taxonomic ranks (genotypes, species, family, class). Therefore, these data sets comprised of highly similar sequences belonging to genotypes of the same species to dissimilar sequences representing distant members of the class. The summary of data sets is given in Table 1.
  • Objectives and types of various phylogeny studies such as multilocus sequence typing (MLST), which involves use of multiple genes, genotyping of isolates of the same species using single gene/genomic regions, identification of new species using complete genomes etc. apart from the study of biomolecular evolution.

Table 1: Summary of data sets used in the study

Sr. no.

Data set

Taxon

aOTUs

bLength

cLength

dIdentity (%)

1

MLST genes

Genus Aeromonas

115

4676-4957

4987

82.17 - 100

2

SH gene

Mumps virus genotypes

32

316-318

318

72.64 - 100

3

Mitochondrial genome

Class Mammalia

31

16338-17447

20317

61.42 - 97.78

4

Complete genome

Genus Enterovirus

113

6944-7458

8006

51.91 - 98.98

5

Complete genome

Family Flaviviridae

59

9406-12813

17163

33.40 - 90.81

a: number of operational taxonomic units (OTUs); b: length variation of unaligned sequences; c: length of aligned sequences; d: variation of percent identity in aligned sequences. Data sets are listed in decreasing order of % sequence identity.

Results: Phylogenetic trees