As I believe, many are aware that since 2007 the US National Institute of Standards and Technology (NIST) has been holding a competition to develop a hash algorithm to replace SHA-1, and a family of SHA-2 algorithms. However, this topic, for some reason, is deprived of attention on the site. This is actually what brought me to you. I bring to your attention a series of articles on hash algorithms. In this cycle, we will study the basics of hash functions together, consider the most famous hash algorithms, plunge into the atmosphere of the SHA-3 competition and consider the algorithms that claim to win it, we will definitely test them. Also, if possible, Russian hashing standards will be considered.

About myself

Student of the Department of Information Security.

About hashing

Currently, almost no cryptography application is complete without the use of hashing.
Hash functions are functions designed to "compress" an arbitrary message or set of data, usually written in the binary alphabet, into some fixed-length bit pattern called a convolution. Hash functions have a variety of applications when conducting statistical experiments, when testing logical devices, when building algorithms for fast search and checking the integrity of records in databases. The main requirement for hash functions is the uniformity of the distribution of their values ​​with a random choice of argument values.
A cryptographic hash function is any hash function that is cryptographically secure, that is, it satisfies a number of requirements specific to cryptographic applications. In cryptography, hash functions are used to solve the following problems:
- building data integrity control systems during their transmission or storage,
- data source authentication.

A hash function is any function h:X -> Y, easily computable and such that for any message M meaning h(M) = H (convolution) has a fixed bit length. X- set of all messages, Y- set of binary vectors of fixed length.

As a rule, hash functions are built on the basis of the so-called one-step contraction functions y \u003d f (x 1, x 2) two variables, where x 1, x2 and y- binary length vectors m, n and n respectively, and n is the length of the convolution, and m- message block length.
To get the value h(M) the message is first broken into blocks of length m(at the same time, if the length of the message is not a multiple of m then the last block is in some special way supplemented to the full one), and then to the received blocks M 1 , M 2 ,.., M N apply the following sequential procedure for calculating the convolution:

H o \u003d v,
H i = f(M i ,H i-1), i = 1,.., N,
h(M) = H N

Here v- some constant, it is often called an initialization vector. She gets out
for various reasons and can be a secret constant or a set of random data (a selection of date and time, for example).
With this approach, the properties of the hash function are completely determined by the properties of the one-step contraction function.

There are two important types of cryptographic hash functions - keyed and keyless. Key hash functions are called message authentication codes. They make it possible to guarantee without additional means both the correctness of the data source and the integrity of the data in systems with mutually trusting users.
Keyless hash functions are called error detection codes. They make it possible with the help of additional means (encryption, for example) to guarantee the integrity of the data. These hash functions can be used in systems with both trusting and non-trusting users.

About statistical properties and requirements

As I said, the main requirement for hash functions is the uniform distribution of their values ​​with a random choice of argument values. For cryptographic hash functions, it is also important that with the slightest change in the argument, the value of the function changes greatly. This is called the avalanche effect.

To key functions hashing has the following requirements:
- the impossibility of fabrication,
- the impossibility of modification.

The first requirement means that it is very difficult to match a message with the correct fold value. The second is the high complexity of matching for a given message with a known fold value another message with the correct fold value.

The requirements for keyless functions are:
- unidirectional,
- resistance to collisions,
- resistance to finding the second prototype.

Unidirectionality is understood as the high complexity of finding a message by a given convolution value. It should be noted that on this moment no hash functions in use with proven one-way.
Collision resistance is understood as the difficulty of finding a pair of messages with the same fold values. Usually, it is the finding of a way to construct collisions by cryptanalysts that serves as the first signal of the obsolescence of the algorithm and the need for its rapid replacement.
The resistance to finding the second preimage is understood as the difficulty of finding a second message with the same fold value for a given message with a known fold value.

This was the theoretical part, which will be useful to us in the future ...

About popular hash algorithms

Algorithms CRC16/32- checksum (not cryptographic conversion).

Algorithms MD2/4/5/6. They are the creation of Ron Rivest, one of the authors of the RSA algorithm.
The MD5 algorithm was once very popular, but the first prerequisites for hacking appeared in the late nineties, and now its popularity is rapidly declining.
The MD6 algorithm is a very interesting algorithm from a constructive point of view. It was nominated for the SHA-3 competition, but, unfortunately, the authors did not manage to bring it to the standard, and this algorithm is not in the list of candidates who passed to the second round.

Ruler algorithms SHA Algorithms that are widely used today. There is an active transition from SHA-1 to SHA-2 version standards. SHA-2 is the collective name for the SHA224, SHA256, SHA384, and SHA512 algorithms. SHA224 and SHA384 are essentially analogues of SHA256 and SHA512, respectively, only after the convolution is calculated, some of the information in it is discarded. They should be used only to ensure compatibility with older models of equipment.

Russian standard - GOST 34.11-94.

In the next article

Overview of MD algorithms (MD4, MD5, MD6).

Literature

A. P. Alferov, Fundamentals of cryptography.

Bruce Schneier, Applied Cryptography.

Hashing is a special method of addressing data (some sort of spacing algorithm) by their unique keys ( key ) to quickly find the information you need.

Basic concepts

Hash table

A hash table is a regular array with a special address given by some function (Hash function).

hash function

A function that converts a data item's key to some index in a table ( hash table), is called hash function or hash function :

i = h (key );

where key- convertible key, i- resulting table index, i.e. the key is displayed in a set of, for example, integers ( hash addresses ), which are later used to access the data.

Hashing in this way is a technique that involves using the value of a key to determine its position in a special table.

However, the spread function can several unique key values ​​give the same position value i in the hash table. The situation in which two or more keys receive the same index (hash address) is called collision (collision) in hashing. Therefore, the hashing scheme must include conflict resolution algorithm , defining the order of actions, if the position i=h(key) is already occupied by an entry with a different key.

There are many hashing schemes that differ in the hash function used. h(key) and conflict resolution algorithms.

The most common method for specifying a hash function is: division method.

The initial data are: - some integer key key and table size m. The result of this function is the remainder of dividing this key by the size of the table. The general form of such a function in the C/C++ programming language:

int h (int key , int m ) {

For m= 10 hash function returns the least significant digit of the key.

For m=100, the hash function returns the two least significant digits of the key.

In the considered examples, the hash function i=h(key) only defines the position from which to look for (or initially place in the table) the entry with the key key. Next, you need to use some kind of hashing scheme (algorithm).

Hashing schemes

In most problems, two or more keys are hashed the same way, but they cannot occupy the same cell in the hash table. There are two possible options: either find a different position for the new key, or create a separate list for each hash table index, in which all keys mapped to this index are placed.

These variants are the two classic hashing schemes:

    hashing by open addressing with linear probing - linear probe open addressing.

    chain hashing (with lists), or the so-called multidimensional hashing - chaining with separate lists;

Open addressing method with linear probing . Initially, all cells of the hash table, which is a normal one-dimensional array, are marked as unoccupied. Therefore, when adding a new key, it is checked whether the given cell is occupied. If the cell is occupied, then the algorithm searches in a circle until there is an empty space (“open address”).

Those. elements with homogeneous keys are placed near the resulting index.

In the future, performing a search, first find the position by the key i in the table, and if the key does not match, then the subsequent search is carried out in accordance with the conflict resolution algorithm, starting from the position i. .

Chain method is the dominant strategy . In this case i obtained from the selected hash function h(key)=i, is treated as an index into a hash table of lists, i.e. key first key next entry is mapped to position i = h(key) tables. If the position is free, then the element with the key is placed in it. key, if it is busy, then a conflict resolution algorithm is worked out, as a result of which such keys are placed in a list starting at i-that cell of the hash table. For example

As a result, we have a table of an array of linked lists or trees.

The process of populating (reading) a hash table is simple, but accessing the elements requires the following operations:

Index Calculation i;

Search in the corresponding chain.

To improve the search when adding a new element, you can use the insertion algorithm not at the end of the list, but with ordering, i.e. add element to Right place.

An example of the implementation of the direct addressing method with linear probing . The initial data are 7 records (for simplicity, the information part consists only of integer data), of the declared structural type:

intkey; // Key

intinfo; // Information

(59.1), (70.3), (96.5), (81.7), (13.8), (41.2), (79.9); hash table size m=10.

hash function i=h(data) =data.key%ten; those. remainder after dividing by 10 - i.

Based on the initial data, we sequentially fill in the hash table.

Hashing the first five keys yields different indexes (hash addresses):

The first collision occurs between keys 81 and 41 - the place with index 1 is occupied. Therefore, we look through the hash table in order to find the nearest free space, in this case it is i = 2.

The next key 79 also generates a collision: position 9 is already occupied. The efficiency of the algorithm falls sharply, because it took 6 trials (comparisons) to find a free place, the index turned out to be free i= 4.

The total number of samples of this method is from 1 to n-1 samples per element, where n is the size of the hash table.

Implementation of the chaining method for the previous example. We declare a structural type for a list element (unidirectional):

intkey; // Key

intinfo; // Information

zap*Next; // Pointer to next element on the list

Based on the initial data, we sequentially fill the hash table by adding new element to the end of the list if the place is already taken.

Hashing the first five keys, as in the previous case, gives different indexes (hash addresses): 9, 0, 6, 1, and 3.

When a collision occurs, the new element is added to the end of the list. Therefore, the element with key 41 is placed after the element with key 81, and the element with key 79 is placed after the element with key 59.

Individual tasks

1. Binary trees. Using the random number generator program, get 10 values ​​from 1 to 99 and build a binary tree.

Make a detour:

1.a Traversal from left to right: Left-Root-Right: visit the left subtree first, then the root, and finally the right subtree.

(Or vice versa, from right to left: Right-Root-Left)

1.b Traversal from top to bottom: Root-Left-Right: visit the root to the subtrees.

1.in Traversal from bottom to top: Left-Right-Root: visit root after subtrees

In a wide variety of industries information technologies find their uses hash functions. They are designed, on the one hand, to greatly simplify the exchange of data between users and the processing of files used for certain purposes, on the other hand, to optimize the algorithms for ensuring access control to the corresponding resources. The hash function is one of key tools ensuring password protection of data, as well as organizing the exchange of documents signed using EDS. There are a large number of standards by which files can be cached. Many of them are developed by Russian specialists. What are the types of hash functions? What are the main mechanisms for their practical application?

What it is?

First, let's explore the concept of a hash function. This term is commonly understood as an algorithm for converting a certain amount of information into a shorter sequence of characters using mathematical methods. The practical importance of the hash function can be traced in a variety of areas. So, they can be used when checking files and programs for integrity. Also, cryptographic hash functions are used in encryption algorithms.

Characteristics

Let us consider the key characteristics of the algorithms under study. Among these:

  • the presence of internal algorithms for converting data of the original length into a shorter sequence of characters;
  • openness for cryptographic verification;
  • the presence of algorithms that allow you to securely encrypt the original data;
  • adaptability to decryption using small computing power.

Other important properties of the hash function include:

  • the ability to process initial data arrays of arbitrary length;
  • generate hashed blocks of fixed length;
  • distribute function values ​​at the output evenly.

The algorithms under consideration also assume sensitivity to input data at the 1-bit level. That is, even if, relatively speaking, at least 1 letter changes in the source document, the hash function will look different.

Requirements for hash functions

There are a number of requirements for hash functions intended for practical use in a particular area. First, the corresponding algorithm must be sensitive to changes in the internal structure of the hashed documents. That is, the hash function should be recognized when it comes to text file, paragraph permutation, hyphenation. On the one hand, the content of the document does not change, on the other hand, its structure is corrected, and this process must be recognized during hashing. Secondly, the algorithm under consideration must transform the data in such a way that the reverse operation (turning the hash into the original document) is impossible in practice. Thirdly, the hash function should involve the use of such algorithms that practically exclude the possibility of forming the same sequence of characters in the form of a hash, in other words, the appearance of so-called collisions. We will consider their essence a little later.

The noted requirements that the hash function algorithm must meet can be achieved mainly through the use of complex mathematical approaches.

Structure

Let us study what the structure of the considered functions can be. As we noted above, among the main requirements for the algorithms under consideration is the provision of unidirectional encryption. A person who has only a hash at his disposal should practically not be able to get the original document based on it.

In what structure can a hash function used for such purposes be represented? An example of its compilation can be as follows: H (hash, that is, hash) = f (T (text), H1), where H1 is the text processing algorithm T. This function hashes T in such a way that without knowledge of H1 it will be practically impossible to open it as a full-fledged file.

Using Hash Functions in Practice: Downloading Files

Let us now study in more detail the options for using hash functions in practice. The use of appropriate algorithms can be used when writing scripts for downloading files from Internet servers.

In most cases, a certain checksum is determined for each file - this is the hash. It must be the same for an object located on the server and downloaded to the user's computer. If this is not the case, then the file may not open or start incorrectly.

Hash function and digital signature

The use of hash functions is common in organizing the exchange of documents containing a digital signature. In this case, the signed file is hashed so that its recipient can verify that it is genuine. Although the hash function is not formally included in the structure electronic key, it can be fixed in the flash memory of the hardware with which documents are signed, such as, for example, eToken.

An electronic signature is the encryption of a file using public and private keys. That is, a message encrypted with the private key is attached to the source file, and the digital signature is verified by means of public key. If the hash function of both documents matches, the recipient's file is recognized as authentic, and the sender's signature is recognized as correct.

Hashing, as we noted above, is not directly a component of the EDS, however, it allows you to very effectively optimize the algorithms for using electronic signature. So, only the hash can be encrypted, not the document itself. As a result, the speed of file processing increases significantly, and at the same time it becomes possible to provide more effective EDS protection mechanisms, since the emphasis in computing operations in this case will be placed not on processing the initial data, but on ensuring the cryptographic strength of the signature. The hash function also makes it possible to sign a variety of data types, not just text.

Password Checker

Another possible area of ​​application of hashing is the organization of password verification algorithms established to differentiate access to certain file resources. How can certain types of hash functions be involved in solving such problems? Very simple.

The fact is that on most servers, access to which is subject to differentiation, passwords are stored in the form of hashed values. This is quite logical - if the passwords were presented in their original text form, hackers who gained access to them could easily read the secret data. In turn, based on the hash, it is not easy to calculate the password.

How is user access checked when the algorithms under consideration are used? The password entered by the user is checked against what is fixed in a hash function that is stored on the server. If the values ​​of the text blocks match, the user gets the necessary access to resources.

The simplest hash function can be used as a password checking tool. But in practice, IT professionals most often use complex multi-stage cryptographic algorithms. Typically, they are supplemented by the use of secure-channel communication standards so that hackers cannot detect or figure out the password transmitted from the user's computer to the servers before it is verified against the hashed text blocks.

Hash function collisions

In the theory of hash functions, such a phenomenon as a collision is provided. What is its essence? A hash collision is a situation in which two different files have the same hash code. This is possible if the length of the target character sequence is small. In this case, the probability of a hash match will be higher.

In order to avoid collision, it is recommended, in particular, to use a double algorithm called "hash function hashing". It involves the formation of open and closed code. Many programmers in solving critical problems recommend not to use hash functions in cases where it is not necessary and always test the corresponding algorithms for the best compatibility with certain keys.

History of appearance

The founders of the theory of hash functions can be considered the researchers Carter, Wegman, Simonson, Bierbrouer. In the first versions, the corresponding algorithms were used as tools for generating unique images of character sequences of arbitrary length with the subsequent purpose of their identification and verification for authenticity. In turn, the hash, in accordance with the specified criteria, should have a length of 30-512 bits. As a special useful property of the corresponding functions, its suitability for using it as a resource for quickly searching for files or sorting them was considered.

Popular hashing standards

Let us now consider in what popular standards hash functions can be represented. One of them is CRC. This algorithm is cyclic code, also called the checksum. This standard is characterized by simplicity and at the same time versatility - through it you can hash the widest range of data. CRC is one of the most common non-cryptographic algorithms.

In turn, MD4 and MD5 standards are widely used in encryption. Another popular cryptographic algorithm is SHA-1. In particular, it is characterized by a hash size of 160 bits, which is larger than that of MD5 - this standard supports 128 bits. There are Russian standards that regulate the use of hash functions - GOST R 34.11-94, as well as GOST R 34.11-2012 that replaced it. It can be noted that the hash value provided by the algorithms adopted in the Russian Federation is 256 bits.

The standards in question can be classified in various ways. For example, there are those that use block and specialized algorithms. The simplicity of calculations based on the standards of the first type is often accompanied by their low speed. Therefore, as an alternative to block algorithms, those that involve a smaller amount of necessary computational operations can be used. It is customary to refer to high-speed standards, in particular, the above-mentioned MD4, MD5, and SHA. Let's consider the specifics of special hashing algorithms on the example of SHA in more detail.

Features of the SHA algorithm

The use of hash functions based on the SHA standard is most often carried out in the field of tool development digital signature DSA documents. As we noted above, SHA algorithm supports a hash of 160 bits (providing a so-called "digest" of a sequence of characters). Initially, the standard under consideration divides the data array into blocks of 512 bits. If necessary, if the length of the last block does not reach the specified figure, the file structure is padded with 1 and the required number of zeros. Also, at the end of the corresponding block, a code is entered that fixes the length of the message. The algorithm under consideration involves 80 logical functions, through which 3 words are processed, represented in 32 bits. The SHA standard also provides for the use of 4 constants.

Comparison of hashing algorithms

Let's study how the properties of hash functions related to different standards correlate, using the example of comparing the characteristics of the Russian standard GOST R 34.11-94 and the American SHA, which we examined above. First of all, it should be noted that the algorithm developed in the Russian Federation involves the implementation of 4 encryption operations per 1 cycle. This corresponds to 128 rounds. In turn, during 1 round, when using SHA, it is supposed to calculate about 20 commands, while there are 80 rounds in total. Thus, using SHA allows processing 512 bits of initial data within 1 cycle. While the Russian standard is capable of performing operations in a cycle of 256 bits of data.

Specifics of the latest Russian algorithm

Above, we noted that the GOST R 34.11-94 standard was replaced by a newer one - GOST R 34.11-2012 Stribog. Let's explore its specifics in more detail.

Through this standard can be implemented, as in the case of the algorithms discussed above, cryptographic hash functions. It can be noted that the latest Russian standard supports a block of input data in the amount of 512 bits. The main advantages of GOST R 34.11-2012:

  • high level of protection against cracking ciphers;
  • reliability, backed by the use of proven designs;
  • prompt calculation of the hash function, the absence of transformations in the algorithm that complicate the construction of the function and slow down the calculation.

Noted advantages of the new Russian standard cryptographic encryption allow you to use it in the organization of workflow that meets the most stringent criteria, which are prescribed in the provisions of the regulatory legislation.

Specificity of cryptographic hash functions

Let us consider in more detail how the types of algorithms we are studying can be used in the field of cryptography. The key requirement for the corresponding functions is resistance to collisions, which we mentioned above. That is, duplicate hash values ​​should not be generated if these values ​​are already present in the structure of the neighboring algorithm. The other criteria noted above must also be met by cryptographic functions. It is clear that there is always some theoretical possibility of recovery source file based on a hash, especially if there is a powerful computing tool available. However, this scenario is supposed to be minimized thanks to strong encryption algorithms. Thus, it will be very difficult to calculate a hash function if its computational strength corresponds to the formula 2^(n/2).

Another important criterion for a cryptographic algorithm is the change in the hash in the event that the initial data array is corrected. Above, we noted that encryption standards should have sensitivity at the level of 1 bit. Thus, this property is a key factor in ensuring reliable password protection of access to files.

Iterative schemes

Let us now study how cryptographic hashing algorithms can be built. Among the most common schemes for solving this problem is the use of an iterative sequential model. It is based on the use of the so-called contraction function, in which the number of input bits is significantly greater than those that are fixed at the output.

Of course, the compression function must meet the necessary cryptographic strength criteria. In the interactive scheme, the first operation for processing the input data stream is divided into blocks, the size of which is calculated in bits. The corresponding algorithm also uses temporary variables of a given number of bits. A well-known number is used as the first value, while subsequent blocks of data are combined with the value of the function in question at the output. The hash value becomes the bit output for the last iteration, which takes into account the entire input stream, including the first value. The so-called "avalanche effect" of hashing is provided.

The main difficulty that characterizes hashing implemented as an iterative scheme is that hash functions are sometimes difficult to construct if the input stream is not identical to the size of the block into which the initial data array is divided. But in this case, algorithms can be written in the hashing standard, by means of which the original stream can be expanded in one way or another.

In some cases, so-called multi-pass algorithms may be involved in the process of data processing within the framework of an iterative scheme. They suggest the formation of an even more intense "avalanche effect". Such a scenario involves the formation of repeated data arrays, and only in the second place is the expansion.

Block Algorithm

The compression function can also be based on a block algorithm by which encryption is performed. So, in order to increase the level of security, you can use blocks of data that are subject to hashing at the current iteration, as a key, and the result of operations obtained during the execution of the compressing function before that, as an input. As a result, the last iteration will provide the output of the algorithm. The security of a hash will correlate with the robustness of the algorithm involved.

However, as we noted above, considering different kinds hash functions, block algorithms are often accompanied by the need to use large computing power. If they are not available, the file processing speed may not be sufficient to solve practical problems related to the use of hash functions. At the same time, the required cryptographic strength can also be realized with a small number of operations with source data streams, in particular, the algorithms we have considered - MD5, SHA, and Russian cryptographic encryption standards - are adapted to solving such problems.

What is a hash? A hash function is a mathematical transformation of information into a short string of a certain length.

Why is this needed? Hash function analysis is often used to check the integrity of important files operating system, important programs, important data. Monitoring can be carried out both on an as-needed basis and on a regular basis.

How it's done? First, determine the integrity of which files need to be controlled. For each file, the value of its hash is calculated according to a special algorithm, and the result is saved. After the necessary time, a similar calculation is made and the results are compared. If the values ​​are different, then the information contained in the file has been changed.

What characteristics should a hash function have?

  • must be able to perform transformations of data of arbitrary length into a fixed one;
  • must have an open algorithm so that its cryptographic strength can be investigated;
  • should be one-sided, that is, there should not be a mathematical possibility to determine the initial data from the result;
  • should "resist" collisions, that is, should not produce the same values ​​for different input data;
  • should not require large computing resources;
  • with the slightest change in the input data, the result should change significantly.

What are the popular hashing algorithms? The following hash functions are currently in use:

  • CRC stands for cyclic redundancy code or checksum. The algorithm is very simple, it has a large number of variations depending on the required output length. Is not cryptographic!
  • MD 5 is a very popular algorithm. Like him previous version MD 4 is a cryptographic function. The hash size is 128 bits.
  • SHA -1 is also a very popular cryptographic function. The hash size is 160 bits.
  • GOST R 34.11-94 is a Russian cryptographic standard for computing a hash function. The hash size is 256 bits.

When can a system administrator use these algorithms? Often, when downloading any content, such as programs from the manufacturer's website, music, movies or other information, there is a checksum value calculated using a certain algorithm. For security reasons, after downloading, you must independently calculate the hash function and compare the value with what is indicated on the site or in the attachment to the file. Have you ever done this?

What is more convenient to calculate the hash? Now there are a large number of such utilities, both paid and free to use. I personally liked HashTab. Firstly, during installation, the utility is embedded as a tab in the file properties, secondly, it allows you to select a large number of hashing algorithms, and thirdly, it is free for private non-commercial use.

What is Russian? As mentioned above, in Russia there is a hashing standard GOST R 34.11-94, which is widely used by many manufacturers of information security tools. One of these tools is the fixation and control program. initial state software package"FIX". This program is a means of monitoring the effectiveness of the use of information security facilities.

FIX (version 2.0.1) for Windows 9x/NT/2000/XP

  • Calculation of checksums of given files using one of 5 implemented algorithms.
  • Fixation and subsequent control of the initial state of the software package.
  • Comparison of software package versions.
  • Fixing and controlling directories.
  • Control of changes in specified files (directories).
  • Generation of reports in TXT, HTML, SV formats.
  • The product has a FSTEC certificate according to NDV 3 No. 913 until June 01, 2013.

And what about the ECP? The result of the hash function calculation, together with the user's secret key, enters the input of the cryptographic algorithm, where the digital signature is calculated. Strictly speaking, the hash function is not part of the EDS algorithm, but often this is done on purpose, in order to exclude a public key attack.

Nowadays, many e-commerce applications allow you to store The secret key user in the private area of ​​the token (ruToken , eToken ) without technical feasibility extracting it from there. The token itself has a very limited memory area, measured in kilobytes. To sign a document, there is no way to transfer the document to the token itself, but it is very easy to transfer the hash of the document to the token and get an EDS at the output.

Hash tables

Hash table(shuffled table, table with computed addresses) is dynamic set supporting operations adding, searching and deleting an element and using special methods addressing.

The main difference between tables and other dynamic sets is element address calculation by key value.

The idea of ​​a hash implementation is that working with one large array is reduced to working with a number of small sets.

For example, a notebook. The pages of the book are marked with letters. A page marked with a letter contains last names beginning with that letter. A large set of surnames is divided into 28 subsets. When searching, the book immediately opens on the desired letter and the search is accelerated.

In programming hash table- this is structure data that stores pairs (key or index + value) and with which three operations are performed: adding a new pair, searching and deleting a pair by key.

Search in hash tables carried out in two stages:

the first step - computing a hash function that converts key search in spreadsheet address:

second step is the process of resolving conflicts in the processing of such keys.

If a different values table keys hash function generates the same addresses, it is said that arises collision(conflict, clash).

Hash functions

The main purpose of the hash function is to match various keys if possible various not negative whole numbers.

Topics hash function better, how less identical it generates values.

The hash function must be chosen in such a way that the following properties are fulfilled:

    the hash function is defined on the elements of the set and takes integer non-negative values;

    hash function easy to calculate;

    hash function can take various values ​​from about equally likely(collision minimization);

    on the relatives argument values hash function takes distant values ​​from each other.

To build a good hash function, you need to know the distribution of keys. If the key distribution is known, then in the ideal case, the key density and the hash value density distribution should be identical.

Let p ( key ) - distribution density of key requests. Then, in the ideal case, the distribution density of table input requests is g ( H ( key )) be such that, on average, the number of elements, cat. it was necessary to pass in chains of twins, it was minimal.

Example.

Let there be a set keys

{0, 1, 4, 5, 6, 7, 8, 9, 15, 20, 30, 40}

and let the table allow 4 entrance.

You can build a hash function:

h(key) = key % 4 .

Then you get the following addresses for inputs

{0, 1, 2, 3} tables:

h(key)

Entry number

Maximum chain length

% hits

3 0.5+1.5 0.25+0.5 0.08+1 0.17 ≈ 2.1 list element.

Example with a different hash function.

h(key)

Entry number

% hits

On average, it will take 4 1.5 0.25 = 1.5 list element.

If this is an information retrieval system, then its search performance will increase by about 25%.

Methods for constructing hash functions

Modular hashing

A simple, efficient and commonly used hashing method.

The table size is selected as simple numbers m and the hash function is calculated as remainder of the division:

h(key) = key % m

key– integer numeric value of the key,

m- number of hash values ​​(hash table entries).

Such a function is called modular and changes from 0 before ( m - 1 ).

Modular hash function in C++:

typedefintHashIndexType;

HashIndexTypeHash(intkey)

{ returnkey % m; }

Example

key = {1, 3, 56, 4, 32, 40, 23, 7, 41,13, 6,7}

Let m = 5

h(key) = {1, 3, 1, 4, 2, 0, 3, 2, 1, 3, 1, 2}

Choice matters m.To get a random distribution of keys, you need to take simple number.

Multiplicative method

Hash function:

h(key) =

0 < A < 1 is a constant.

12 mod5 = 2 (remainder after dividing 12 by 5).

5,04 mod1= 0,04 (stands out fractional part)

Example

key = 123456

m = 10000

A = 0,6180339887499 = 0,618…

h(key) = =

additive method

Is used for lines variable length (table size m equals 256).

{ HashIndexType h = 0;

while (*str)

h += (*str)++;

returnh;

The disadvantage of the additive method is that similar words and anagrams are not distinguished, i.e. h(XY ) = h(YX )

additive method, where the key is a character string. In a hash function, a string is converted to an integer by summing all the characters and returning the remainder after dividing by m (usually table size m = 256).int h(char *key, int m) (int s = 0;while(*key)s += *key++;return s % m;) abc and cab.This method can be slightly modified, getting the result by summing only the first and last characters of the key string. int h(char *key, int m) (int len ​​= strlen(key), s = 0;if(len< 2) // Если длина ключа равна 0 или 1,s = key; // возвратить keyelse s = key + key;return s % m;}В этом случае коллизии будут возникать только в строках, например, abc and amc.

the hash function takes a key and calculates the address in the table using it (the address can be an index in the array to which the chains are attached), that is, for example, it can get the number 3 from the string "abcd", and from the string "efgh" it can get the number 7 and then the first structure of the chain is taken through hash, or through hash the search continues along the chain until "abcd" is found in the chain of structures from hash, or "efgh" is found in the chain of structures from hash when the structure with "abcd "is found, the rest of its data is taken and returned, or all of it is returned in general (its address), so that you can take the rest of the data from it, and the chain of structures is created because many different keys, have the same address in the table, that is, for example, the hash function for "abcd" can return 3 and for "zxf9" can also return 3, so they are linked into a chain that hangs on the third index of the array .. ....

The array H stores the key-value pairs themselves. The element insertion algorithm checks the cells of the array H in some order until the first free cell is found, in which the new element will be written.

The search algorithm searches the cells of the hash table in the same order as when inserting, until either an element with the desired key or a free cell is found (which means there is no element in the hash table).

XOR

Used for variable length strings. The method is similar to the additive method, but distinguishes similar words. It consists in the fact that the "exclusive OR" operation is sequentially applied to the elements of the string

typedef unsigned char HashIndexType;

unsigned char Rand8;

HashIndexType Hash(char *str)

( unsigned char h = 0;

while (*str) h = Rand8;

returnh; }

Here Rand8 – a table of 256 eight-bit random numbers.

table size<= 65536

typedef unsigned short int HashIndexType;

unsigned char Rand8;

HashIndexType Hash(char *str)

( HashIndexType h; unsigned char h1, h2;

if (*str == 0) return 0;

h1 = *str; h2 = *str + 1; str++;

while (*str)

( h1 = Rand8; h2 = Rand8;

str++; )

h = ((HashIndexType)h1<< 8) | (HashIndexType)h2;

return h % HashTableSize )

Universal hashing

Implies random selection of a hash function from some set during fulfillment programs.

If in the multiplicative method to use as BUT subsequence random values ​​instead of a fixed number, you get a universal hash function.

However, the time to generate random numbers will be too big.

Can be used pseudo-random numbers.

// pseudo-random number generator

typedefintHashIndexType;

HashIndexTypeHash(char*v, intm)

( int h, a = 31415, b = 27183;

for(h = 0;*v != 0; v++, a = a*b % (m - l))

h = (a*h + *v) % m;

return(h< 0) ? (h + m) : h;