*Aleksandar Milosavljević*

- Published in print:
- 1999
- Published Online:
- November 2020
- ISBN:
- 9780195119404
- eISBN:
- 9780197561256
- Item type:
- chapter

- Publisher:
- Oxford University Press
- DOI:
- 10.1093/oso/9780195119404.003.0006
- Subject:
- Computer Science, Systems Analysis and Design

The parsimony method for reconstruction of evolutionary trees (Sober, 1988) and the minimal edit distance method for DNA sequence alignments (e.g., Waterman, 1984) are both based on the principle ...
More

The parsimony method for reconstruction of evolutionary trees (Sober, 1988) and the minimal edit distance method for DNA sequence alignments (e.g., Waterman, 1984) are both based on the principle of Occam’s Razor (e.g., Losee, 1980; also known as the Parsimony principle). This principle favors the most concise theories among the multitudes that can possibly explain observed data. The conciseness may be measured by the number of postulated mutations within an evolutionary tree, by the number of edit operations that transform one DNA sequence into the other, or by another implicit or explicit criterion. A very general mathematical formulation of Occam’s Razor has been proposed via minimal length encoding by computer programs (for recent reviews, see Cover and Thomas, 1991; Li and Vitányi, 1993). Algorithmic significance is a general method for pattern discovery based on Occam’s Razor. The method measures parsimony in terms of encoding length, in bits, of the observed data. Patterns are defined as datasets that can be concisely encoded. The method is not limited to any particular class of patterns; the class of patterns is determined by specifying an encoding scheme. To illustrate the method, consider the following unusual discovery experiment: . . . 1. Pick a simple pseudorandom generator for digits from the set {0, 1, 2, 3}. 2. Pick a seed value for the generator and run it to obtain a sequence of 1000 digits; convert the digits to a DNA sequence by replacing all occurrences of digit 0 by letter A, 1 by G, 2 by C, and 3 by T. 3. Submit the sequence to a similarity search against a database containing a completely sequenced genome of a particular organism. . . . Assume that after an unspecified number of iterations of the three steps, with each iteration involving a different random generator or seed value or both, the search in the third step finally results in a genomic sequence highly similar to the query sequence. Does the genomic sequence contain a pattern? To argue for the presence of a pattern, one may directly apply the algorithmic significance method.
Less

The parsimony method for reconstruction of evolutionary trees (Sober, 1988) and the minimal edit distance method for DNA sequence alignments (e.g., Waterman, 1984) are both based on the principle of Occam’s Razor (e.g., Losee, 1980; also known as the Parsimony principle). This principle favors the most concise theories among the multitudes that can possibly explain observed data. The conciseness may be measured by the number of postulated mutations within an evolutionary tree, by the number of edit operations that transform one DNA sequence into the other, or by another implicit or explicit criterion. A very general mathematical formulation of Occam’s Razor has been proposed via minimal length encoding by computer programs (for recent reviews, see Cover and Thomas, 1991; Li and Vitányi, 1993). Algorithmic significance is a general method for pattern discovery based on Occam’s Razor. The method measures parsimony in terms of encoding length, in bits, of the observed data. Patterns are defined as datasets that can be concisely encoded. The method is not limited to any particular class of patterns; the class of patterns is determined by specifying an encoding scheme. To illustrate the method, consider the following unusual discovery experiment: . . . 1. Pick a simple pseudorandom generator for digits from the set {0, 1, 2, 3}. 2. Pick a seed value for the generator and run it to obtain a sequence of 1000 digits; convert the digits to a DNA sequence by replacing all occurrences of digit 0 by letter A, 1 by G, 2 by C, and 3 by T. 3. Submit the sequence to a similarity search against a database containing a completely sequenced genome of a particular organism. . . . Assume that after an unspecified number of iterations of the three steps, with each iteration involving a different random generator or seed value or both, the search in the third step finally results in a genomic sequence highly similar to the query sequence. Does the genomic sequence contain a pattern? To argue for the presence of a pattern, one may directly apply the algorithmic significance method.