Autocorrelation (words)

Summary

In combinatorics, a branch of mathematics, the autocorrelation of a word is the set of periods of this word. More precisely, it is a sequence of values which indicate how much the end of a word looks likes the beginning of a word. This value can be used to compute, for example, the average value of the first occurrence of this word in a random string.

Definition edit

In this article, A is an alphabet, and   a word on A of length n. The autocorrelation of   can be defined as the correlation of   with itself. However, we redefine this notion below.

Autocorrelation vector edit

The autocorrelation vector of   is  , with   being 1 if the prefix of length   equals the suffix of length  , and with   being 0 otherwise. That is   indicates whether  .

For example, the autocorrelation vector of   is   since, clearly, for   being 0, 1 or 2, the prefix of length   is equal to the suffix of length  . The autocorrelation vector of   is   since no strict prefix is equal to a strict suffix. Finally, the autocorrelation vector of   is 100011, as shown in the following table:

a a b b a a
a a b b a a 1
a a b b a a 0
a a b b a a 0
a a b b a a 0
a a b b a a 1
a a b b a a 1

Note that   is always equal to 1, since the prefix and the suffix of length   are both equal to the word  . Similarly,   is 1 if and only if the first and the last letters are the same.


Autocorrelation polynomial edit

The autocorrelation polynomial of   is defined as  . It is a polynomial of degree at most  .

For example, the autocorrelation polynomial of   is   and the autocorrelation polynomial of   is  . Finally, the autocorrelation polynomial of   is  .

Property edit

We now indicate some properties which can be computed using the autocorrelation polynomial.

First occurrence of a word in a random string edit

Suppose that you choose an infinite sequence   of letters of  , randomly, each letter with probability  , where   is the number of letters of  . Let us call   the expectation of the first occurrence of ?  in  ? . Then   equals  . That is, each subword   of   which is both a prefix and a suffix causes the average value of the first occurrence of   to occur   letters later. Here   is the length of v.

For example, over the binary alphabet  , the first occurrence of   is at position   while the average first occurrence of   is at position  . Intuitively, the fact that the first occurrence of   is later than the first occurrence of   can be explained in two ways:

  • We can consider, for each position  , what are the requirement for  's first occurrence to be at  .
    • The first occurrence of   can be at position 1 in only one way in both case. If   starts with  . This has probability   for both considered values of  .
    • The first occurrence of   is at position 2 if the prefix of   of length 3 is   or is  . However, the first occurrence of   is at position 2 if and only if the prefix of   of length 3 is  . (Note that the first occurrence of   in   is at position 1.).
    • In general, the number of prefixes of length   such that the first occurrence of   is at position   is smaller for   than for  . This explain why, on average, the first   arrive later than the first  .
  • We can also consider the fact that the average number of occurrences of   in a random string of length   is  . This number is independent of the autocorrelation polynomial. An occurrence of   may overlap another occurrence in different ways. More precisely, each 1 in its autocorrelation vector correspond to a way for occurrence to overlap. Since many occurrences of   can be packed together, using overlapping, but the average number of occurrences does not change, it follows that the distance between two non-overlapping occurrences is greater when the autocorrelation vector contains many 1's.

Ordinary generating functions edit

Autocorrelation polynomials allows to give simple equations for the ordinary generating functions (OGF) of many natural questions.

  • The OGF of the languages of words not containing   is  .
  • The OGF of the languages of words containing   is  .
  • The OGF of the languages of words containing a single occurrence of  , at the end of the word is  .

References edit

  • Flajolet and Sedgewick (2010). Analytic Combinatorics. New York: Cambridge University Press. pp. 60-61. ISBN 978-0-521-89806-5.
  • Rosen, Ned. "Expected waiting times for strings of coin flips" (PDF). Retrieved 3 December 2017.
  • Odlyzko, A. M.; Guibas, L. J. (1981). "String overlaps, pattern matching, and nontransitive games". Journal of Combinatorial Theory. Series A 30 (2): 183–208. doi:10.1016/0097-3165(81)90005-4.