Pearson hashing

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Pearson hashing[1][2] is a hash function designed for fast execution on processors with 8-bit registers. Given an input consisting of any number of bytes, it produces as output a single byte that is strongly dependent[1] on every byte of the input. Its implementation requires only a few instructions, plus a 256-byte lookup table containing a permutation of the values 0 through 255.

This hash function is a CBC-MAC that uses an 8-bit substitution cipher implemented via the substitution table. An 8-bit cipher has negligible cryptographic security, so the Pearson hash function is not cryptographically strong, but it is useful for implementing hash tables or as a data integrity check code, for which purposes it offers these benefits:

  • It is extremely simple.
  • It executes quickly on resource-limited processors.
  • There is no simple class of inputs for which collisions (identical outputs) are especially likely.
  • Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.
  • Two input strings differing by exactly one character never collide.[3] E.g., applying the algorithm on the strings ABC and AEC will never produce the same value.

One of its drawbacks when compared with other hashing algorithms designed for 8-bit processors is the suggested 256 byte lookup table, which can be prohibitively large for a small microcontroller with a program memory size on the order of hundreds of bytes. A workaround to this is to use a simple permutation function instead of a table stored in program memory. However, using a too simple function, such as T[i] = 255-i, partly defeats the usability as a hash function as anagrams will result in the same hash value; using a too complex function, on the other hand, will affect speed negatively. Using a function rather than a table also allows extending the block size. Such functions naturally have to be bijective, like their table variants.

The algorithm can be described by the following pseudocode, which computes the hash of message C using the permutation table T:

algorithm pearson hashing is      h := 0        for each c in C loop          h := T[ h xor c ]      end loop        return h  

The hash variable (h) may be initialized differently, e.g. to the length of the data (C) modulo 256; this particular choice is used in the Python implementation example below.

Python implementation to generate a (pseudo) 8-bit output[edit]

The 'table' parameter requires a pseudo-randomly shuffled list of range [0..255]. This may easily be generated by using python's builtin range function and using random.shuffle to permutate it:

 1 from random import shuffle   2    3 example_table = list(range(0, 256))   4 shuffle(example_table)   5    6 def hash8(message: str, table) -> int:   7     """Pearson hashing."""   8     hash = len(message) % 256   9     for i in message:  10         hash = table[hash ^ ord(i)]  11     return hash  

C implementation to generate 64-bit (16 hex chars) hash[edit]

 1    void Pearson16(const unsigned char *x, size_t len,   2              char *hex, size_t hexlen)    3    {   4       size_t i;   5       size_t j;   6       unsigned char h;   7       unsigned char hh[8];   8       static const unsigned char T[256] = {   9       // 0-255 shuffled in any (random) order suffices  10        98,  6, 85,150, 36, 23,112,164,135,207,169,  5, 26, 64,165,219, //  1  11        61, 20, 68, 89,130, 63, 52,102, 24,229,132,245, 80,216,195,115, //  2  12        90,168,156,203,177,120,  2,190,188,  7,100,185,174,243,162, 10, //  3  13       237, 18,253,225,  8,208,172,244,255,126,101, 79,145,235,228,121, //  4  14       123,251, 67,250,161,  0,107, 97,241,111,181, 82,249, 33, 69, 55, //  5  15        59,153, 29,  9,213,167, 84, 93, 30, 46, 94, 75,151,114, 73,222, //  6  16       197, 96,210, 45, 16,227,248,202, 51,152,252,125, 81,206,215,186, //  7  17        39,158,178,187,131,136,  1, 49, 50, 17,141, 91, 47,129, 60, 99, //  8  18       154, 35, 86,171,105, 34, 38,200,147, 58, 77,118,173,246, 76,254, //  9  19       133,232,196,144,198,124, 53,  4,108, 74,223,234,134,230,157,139, // 10  20       189,205,199,128,176, 19,211,236,127,192,231, 70,233, 88,146, 44, // 11  21       183,201, 22, 83, 13,214,116,109,159, 32, 95,226,140,220, 57, 12, // 12  22       221, 31,209,182,143, 92,149,184,148, 62,113, 65, 37, 27,106,166, // 13  23         3, 14,204, 72, 21, 41, 56, 66, 28,193, 40,217, 25, 54,179,117, // 14  24       238, 87,240,155,180,170,242,212,191,163, 78,218,137,194,175,110, // 15  25        43,119,224, 71,122,142, 42,160,104, 48,247,103, 15, 11,138,239  // 16  26       };  27   28       for (j = 0; j < 8; ++j) {  29          h = T[(x[0] + j) % 256];  30          for (i = 1; i < len; ++i) {  31             h = T[h ^ x[i]];  32          }  33          hh[j] = h;  34       }  35   36       snprintf(hex, hexlen, "%02X%02X%02X%02X%02X%02X%02X%02X",  37          hh[0], hh[1], hh[2], hh[3], hh[4], hh[5], hh[6], hh[7]);  38    }  

The scheme used above is a very straightforward implementation of the algorithm, with a simple extension to generate a hash longer than 8 bits. That extension comprises the outer loop (i.e. all statement lines that include the variable j) and the array hh.

For a given string or chunk of data, Pearson's original algorithm produces only an 8-bit byte or integer, 0–255. However, the algorithm makes it extremely easy to generate a hash of whatever length is desired. As Pearson noted, a change to any bit in the string causes his algorithm to create a completely different hash (0-255). In the code above, following every completion of the inner loop, the first byte of the string is effectively incremented by one (without modifying the string itself).

Every time that simple change to the first byte of the data is made, a different Pearson hash, h, is generated. The C function builds a 16 hex character hash by concatenating a series of 8-bit Pearson hashes (collected in hh). Instead of producing a value from 0 to 255, this function generates a value from 0 to 18,446,744,073,709,551,615 (= 264 - 1).

This shows that Pearson's algorithm can be made to generate hashes of any desired length by concatenating a sequence of 8-bit hash values, each of which is computed simply by slightly modifying the string each time the hash function is computed. Thus the same core logic can be made to generate 32-bit or 128-bit hashes.

See also[edit]

References[edit]

  1. ^ a b Pearson, Peter K. (June 1990), "Fast Hashing of Variable-Length Text Strings", Communications of the ACM, 33 (6): 677, doi:10.1145/78973.78978
  2. ^ Online PDF file of the CACM paper.
  3. ^ Lemire, Daniel (2012), "The universality of iterated hashing over variable-length strings", Discrete Applied Mathematics, 160 (4–5): 604–617, arXiv:1008.1715, doi:10.1016/j.dam.2011.11.009