The same data should always produce the same hash, and different data a different hash, but because the hash value is very short compared to the data being hashed, it's possible for different data to yield the same hash value. This problem is called a hash collision. However, in a typical hash of over a hundred bits in length, the chance of a collision is astronomically small, like picking atoms completely at random out of the entire universe and happening to pick the same one twice.
In an ideal hashing function there is no predictable correspondence between the initial data and the output data such as two similar inputs resulting in closer outputs, which means that a hash of a piece of data can act as a kind of finger print that identifies it specifically from all other data, and there is no way of obtaining the original data or anything about it from its hash value.
In reality though cryptographic weaknesses have been found in different hashing algorithms as available processing power increases which allow attackers to predict certain things about the original content from a hash value, or to deliberately create collisions. For this reason hash functions are continually improving, for example MD5 is now only used for checksums or non-critical applications, and the SHA group are used for more security-conscious applications.
Hashing is used in practically every computing application and plays especially important roles in the domain of crypto-currencies which are described below. If you're learning about Bitcoin or other crypto-currencies but are not familiar with hashing, hash-based verification methods or proof-of-work systems, you should read this before starting the Bitcoin Whitepaper.
Hashes allow data to be referred to by it's hash value which is much more economical on resources than transferring and storing the entire content. When a client needs the content they can request it and then hash it themselves to ensure they get the same result so they can be sure that it's exactly the same data as they requested. This is why download sites often supply the MD5 hash along with each downloadable item so the client can easily get the hash of their file locally and ensure it matches the one specified on the original site so they know that the file hasn't been corrupted or truncated in transit.
Another very simple but useful way of using hashes is for storing user passwords, if somebody hacks the system and obtains all the user data, all they have is hash values which they can't obtain the original password from without trying every possible combination. Actually trying every possible combination isn't that hard if you have a simple password, which is why sites always recommend you use upper and lower case, numbers and symbols in your passwords. When a user logs in, the server simply hashes the password they enter and checks if the hash matches the stored hash, if it does the server can be sure the user has entered the same value. If a user forgets their password, they'll need to create a new one, because the server cannot tell them what their password is since it only has hash values stored.
This is a very over-simplified explanation but is sufficient to yield a basic understanding of the mechanism. The idea with public-key cryptography is that a user holds a pair of keys, one is to be kept completely secret and backed up securely, the other is able to be shared with the public. The idea is that anyone who has the public key can use it to encrypt information that they know only the person with the corresponding private key can decrypt.
There are many different variations on this theme, such as PGP keys are used for encrypted email which means that people can publish their public PGP key allowing anyone to send encrypted email to them that only they can decrypt with their private PGP key. There are also RSA key-pairs which allow people to give access to services using a persons public RSA key knowing that only the holder of the corresponding private RSA key will be able to access. A bitcoin address is a public key that anyone can send coins to knowing that only the holder of the corresponding private bitcoin key can spend.
The way that public and private keys are related is very similar to how a piece of content is related to its hash - you can think of a public key as a hash of the corresponding private key. If you have the public key there's no way that you can find out anything at all about the private key, but if you have the private key you can easily create the corresponding public key.
People can use keys and hashes to digitally sign content. The holder of the private key can create a unique signature that corresponds to the hash of some content such as a statement, and then anyone with that person's public key can verify that the signature was signed by the holder of the private key. All different types of keys, no matter what their main purpose offer this signing mechanism, because the signature process is fundamental to the performing of the encryption or access mechanism.
Hash lists and trees
In many systems, the content changes over time, and in this case the hash verification process can be a hash of not only the current state of the content, but also of the hash of the previous state, which in turn contains the hash of the state before, and so on. In this way it's impossible to modify any of the previous states without creating a totally different hash in the current state. The same applies to trees (called Merkle trees, see also Vitalik's post on Merkle trees) where each node contains a hash of the child node's hashes so that any change of data of any node in the tree will cause different hash values to propagate up the tree all the way to the root node.
Another really useful kind of hash-tree is used in deterministic wallets for cryto-currencies. These are wallets that allow many different addresses, even across multiple different types of coins, to be accessed by a single seed (usually composed of a twelve word phrase). This is a protocol called BIP-32 which uses the seed as the root of a tree of addresses. It's basically the idea of appending the seed with path information just like the folder tree on your hard-drive except all the folder are numbers, e.g. "seed/0/1/1" or "seed/2/5/3/3". Taking the hash of the seed and path gives a unique seed from which to create an address for a particular asset type, and you'll always get the same address from the same seed and path - just like you always get the same hash result from the same content.
Proof of work
Hashing can be used to prove that a certain amount of computation has been done that corresponds to some piece of data which is called "proof of work". This is simply done by appending the data with an arbitrary number (called a "nonce") that satisfies a particular constraint on the resulting hash. For example the "Hashcash" system (described below) requires that an email message header appended with its nonce has twenty zeros at the start of its binary hash result. Since a hash with many zeros at the beginning is very rare, hundreds of millions of different nonce values will need to be tried before finding such a hash result.
As a concrete example, lets take the sentence above, "the quick brown fox jumps over the lazy dog", and find the first number on the end of the sentence that yields a hash starting with four zeros in the standard base sixteen result (which is like saying sixteen zeros at the start of the result if it were shown in binary). I wrote a few lines of code to quickly loop through trying each hash starting with "the quick brown fox jumps over the lazy dog1" revealing an MD5 hash value of e48c7b8bfcdc4efe97e08c43ae387524 which doesn't start with any zeros, so it tried 2,3,4 and so on... It didn't find a hash starting with four leading zeros until it got all the way up to the number 49609 on the end of the sentence which has a hash value of 0000fc25d6614bf1636577c532e86788. The only way I could find a number (my nonce) that gave me a hash with four leading zeros like this was to keep trying either randomly or counting in some way, there's no short cut, I had to do the work. But for you to prove that my number does give these zeros (my nonce is valid), you just need to get the hash of "the quick brown fox jumps over the lazy dog49609", which in Linux you can do as follows:
echo "the quick brown fox jumps over the lazy dog49609" | md5sum
Which gives you the following output:
And there's the four leading zeros showing that my nonce of 49609 is valid, and that it really is proof that I must have done this work, and done it specifically for this exact sentence and no other. I don't need to show you my script, tell you what language I used, how long it took or anything else, the nonce stands independently as proof of the work.
In this same way a piece of data can be accompanied by a proof of work in the form of a nonce, so that a client that wishes to verify that this work has really been done can simply hash the data appended with the nonce value and check if the resulting hash value satisfies the criteria, such as having twenty leading zeros in the case of Hashcash or crypto-currency networks.
Just in case you're curious, here's the one-liner I used to find my nonce which will work on any Unix-like system. You can modify this to experiment with your own nonce-searches by changing the message or the four zeros. In plain English this code says to keep increasing the value of a variable called $n while our sentence appended with $n and sent through the md5sum program does not start with 0000, then after that (i.e. it does start with 0000) print the value of $n.
perl -e '$n++ while`echo "the quick brown fox jumps over the lazy dog$n"|md5sum`!~/^0000/;print$n'
The above code is very inefficient and takes a couple of minutes to find our nonce, but is good for a quick example that doesn't require anything from the environment. Here's another version which will find the nonce in a fraction of a second, but does require that your version of Perl has MD5 support, which probably most do by default these days. This version is fast enough to find the nonce for five leading zeros (twenty binary zeros) in about two seconds (on my 1.7GHz core i5) which is 2009982.
perl -e 'use Digest::MD5 qw(md5_hex);$n++ while md5_hex("the quick brown fox jumps over the lazy dog$n\n")!~/^0000/;print $n'
Proof of work in crypto-currencies
this section is still in-progress
In crypto-currency networks like Bitcoin, the mining process collects together unconfirmed transactions (ones that are not part of any block yet) into blocks and accompanies each block with a proof-of-work nonce.
The reason this is confusing is because the phrase "working on" makes it sounds as if when a new transaction is received the whole block header would change and therefore all the work done so far to find the nonce for the current block will have been wasted.
This however is not the case because the chances of finding a valid nonce are always the same no matter how much work you've already done, to think otherwise is a logical flaw known as "the gamblers fallacy". It's like thinking that if you've previously tossed a coin and it's come up heads many times in a row, that your chances of getting heads this time are much lower than 50%, but of course it's always 50/50.
So to sum that up, it's not wasting any time when new unconfirmed transactions keep arriving while trying to find a valid nonce. When one is finally found it will contain all the unconfirmed transactions that miner has seen up to that point, it doesn't have to queue any up for the next block while it's figuring this one out.
Another thing is that I wondered why everybody counts from zero and increments when trying to find a nonce since I thought this would mean everybody is racing through the same set of hash results (at least, those that had the same set of unconfirmed transactions in the same order, would be), but actually everybody's block header is guaranteed to be different because they all contain at least a different address to which the block reward should go.
Hashcash is a proof-of-work system designed to limit email spam and denial-of-service attacks. Hashcash was proposed in May 1997 by Adam Back. This was originally an idea for how to stop spam (or more generally denial-of-service attacks). The idea was that tokens could be made by requesters of a service which require significant processing power to make (as described above), and can be validated by the recipient to ensure that the work has indeed been done and it matches the headers of the email it's within. This would make spamming impractical because every message requires at least a second of work to process.
It takes an average of about one second for 1GHz hardware to find a hash with twenty leading zeros, but this criteria of twenty needs to increase over time to maker it harder to find a valid token because processing power always increases and becomes cheaper (Moore's law).
These valid proof-of-work hashes contain inherent value within them, because they've taken time to build and prove that they belong to (i.e. are owned by) the message header they've been constructed for. The Hashcash algorithm could be used in a more general context than email by allowing them to be reused and transferred between owners. A central provider could "mint" them and keep track of the current owner of each one.
In the Bitcoin network, the coins and who owns each (and all historical transfers) are held in a distributed database called the blockchain. The bitcoin owner is in the form of a bitcoin address rather than an email address. The bitcoin address is actually a public key and only the person with the corresponding private key has the abiltity to transfer it to another address.
Bitcoin also distributes the minting process by allowing anyone to do it, and has a difficulty level (how numerically low the hash result has to be to be considered valid) that auto-adjusts to ensure that the issuance rate matches a specific pre-determined curve no matter how many people are mining, or how fast the computer can process the hashes. The difficulty automatically adjusts to keep the average time it takes to find a valid block about ten minutes.
In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has been employed in a wide variety of security applications, and is also commonly used to check the integrity of files. An MD5 hash is typically a 32-character hexadecimal number. Recently, a number of projects have created MD5 "rainbow tables" which are easily accessible online, and can be used to reverse many MD5 strings into their original meanings.
MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function, MD4. In 1996, a flaw was found with the design of MD5; while it was not a clearly fatal weakness, cryptographers began to recommend using other algorithms, such as SHA-1. In 2004, more serious flaws were discovered making further use of the algorithm for security purposes questionable, but it's still very useful for simple data integrity purposes.
On any Unix-like OS you can simply use the following syntax examples, the first gives you the hash of the content of a file, and the second gives you the hash of the given text.
md5sum filename echo -n "some text to hash instead of a file" | md5sum
On 23 Feb 2017, Google and a team from Amsterdam announced the first SHA1 collision created using a practically reproducible technique. It is now urgent that security sensitive applications using SHA1 move to SHA256.
- Random numbers, Encryption and Hashing
- Hash function
- Hash collision
- Merkle tree (see also Vitalik's post on Merkle trees)
- Hash table
- Distributed hash table
- The Keccak sponge function family - NIST selects Keccak for SHA-3
- The Sponge Functions Corner
- How RSA works - nothing to do with hashing, but similarly important to know, also the math behind bitcoin
- Public Key Cryptography: Diffie-Hellman Key Exchange - video using paint and clocks to explain key exchange
- What is MAST? - understanding merkalised abstract syntax trees