Explanation of Cybersecurity Hashing and Collisions

This blog post is a transcript of Christian Espinosa’s explanation of cybersecurity hashing and collisions and covers the following:

What is hashing?
What is a hashing collision?
What are hashing birthday attacks?
Includes a demonstration of a 3-way MD5 collision

Check out my latest book: https://christianespinosa.com/books/the-smartest-person-in-the-room/

In Dec 2020, Alpine Security was acquired by Cerberus Sentinel (https://www.cerberussentinel.com/)

Need cybersecurity help? Connect with me: https://christianespinosa.com/cerberus-sentinel/

Complete Cybersecurity Hashing and Collisions Explanation Video Transcript

How’s it going? This is Christian Espinosa with Alpine Security. Today’s topic is on hashing and collisions. First off, what is hashing? A lot of people think hashing is encryption, but hashing is really not encryption. It is a mathematical function, but when you do encryption you typically need a key to decrypt something. With hashing, there’s no key involved. Hashing is really a one-way function. What that means is if we take some data and cram it through a hashing algorithm, out spits what’s called a message digest. The message digest is a fixed linked. The data we cram through the algorithm can be a variable linked. We can cram one byte or one terabyte through let’s say an MD5 hash algorithm and out spits a 128 bit message digest, regardless of the size of data we put through it. It’s called a one-way function because you can’t take the message digest value and go backwards through the hash algorithm and reproduce the data that was used to create the message digest. It doesn’t work that way. It’s only one way.

Hashing is used for integrity and to store passwords. Passwords should never be stored in cleartext. They should be stored in a hashed format or message digest. That way if someone types in a password on the password system, the password is hashed. If the hash value matches what’s stored, you know what the password is. There’s a little bit more to it than that, but that’s basically the concept.

Hashing is also used for integrity because if I take some data and run it through the hash algorithm, I get a message digest. If the data changes that all and I run it through the hash algorithm, I should get a different message digest, which tells me that the data has changed. If the data is the same, I’ve run it through the hash algorithm, I get the same message digest, I know I have integrity and the data has not been altered. One of the primary uses of hashing is for integrity and another use is for passwords.

There’s a lot of discussion about collisions and I’m going to do a demonstration here. Collisions are when you have two different inputs, you take it through the hash algorithm, and both different inputs produce the same message digest. That’s called a collision. Two different things produce the same thing, collision. Technically, that should not happen with hashing. If it does happen, it kind of breaks the concept of hashing because now we can have two different things that produce the same message digest. If we think this thing here hasn’t been altered, but it has been altered, and if we look at hashing to prove that it hasn’t been altered, we may have a false sense of integrity because the two different things can produce the same output or the same message digest, which is the collision.

I’ll give an example of collision. MD5 is a common hash algorithm that’s been broken, as they like to say, and there are examples of collisions with MD5s. MD5 uses 128 bits. SHA1 is another type of hash algorithm that uses 160 bits, and there’s quite a few other ones. SHA512 uses 512 bids. But basically the larger the message digest, the more bits, the less likely of a collision, based on simple math at math. With MD5, there’s 128 bits. There’s a lot of discussion about collisions, but the bottom line is collisions are highly unlikely, even with MD5. It’s very unlikely somebody can intelligently alter some bit of data and generate the same message digest as some other signed piece of software, for instance.

Let’s just look at the math for this. I’m going to bring up a calculator in scientific mode and with 128 bits, that’s 2 to the 128th power, because we have a zero or one, which the 2, 128 bits. Those are the number of combinations with 128 bits. Let’s just look at that. So 2 to the 128, equals 3.4 whatever. Basically, that’s a really large number, so the probability of creating a collision is very small. It is possible though. This probability is going to be smaller with SHA512, there’s more bits.

The example that this one guy has figured out with MD5, is he’s had three images that produce the same message digest. What this guy did, if you actually look at the images in a hex editor, is he basically took one base image, which had a specific message digest, then he took another image and altered a bit at a time in the header of the image and kept running it through the hash algorithm until it basically produced the same messages digested. Then he stopped altering the bits, and the images look different because we’re not looking at the metadata, we’re just looking at the actual image data, visually. Then he did it again with a third image. This is the proof that there are collisions with MD5.

et’s look at this here. We’re going to use this tool called HashCalc to do the calculation. HashCalc is one of my favorite hashing calculators. If we just searched for White, Black and Brown MD5, we should find what this guy did. Three-way MD5 collision, Nat McHugh’s the dude’s name. Basically, we have these three pictures, Jack Black, James Brown, Barry White, I believe. I’m going to save each of these. I’m going to save this one as Jack Black. This one is James Brown. This one is Barry White. Then we’ll run these three through the hash calculator. And you can see, I’ll go ahead and open the images. If we look at the images, this one is obviously different than this one, and different than this one right here. The three images are different. The three data sets are different, but they produce the same MD5 message digest.

Let’s check this out. Here’s HashCalc. I’ll put the link to HashCalc in the video. With HashCalc you can just drag and drop the image. Here is Black, I’m going to drag it over here. Before I do that, I’ve got MD5 selected and I also have SHA1 selected. We’ll do the message digest for both of those algorithms. So there’s Black, here’s the message digest. I’ll go ahead and copy this. I’ll put this here in our notepad. Black equals this. That’s the MD5, and the SHA1 though, is this, for Black. That’s the SHA1. Let’s try White. White equals… Let’s take White and we’re going to drag the White image to HashCalc. So the MD5 for White is this. That’s MD5. The SHA1 is this. And with Brown, let’s try that one. You kind of know where this is all headed, right? With Brown, we expect the MD5 to be the same. We’ll drag it over here, I’ll go ahead and copy that. Brown MD5, it looks like it’s the same. The SHA1 should be different.

All right. With Jack Black, MD5 right here, Barry White right here, and James Brown, same MD5, three different inputs, that’s a collision. You notice the SHA1 is different for all three though. It is extremely, extremely unlikely you can create a collision that works with all hash algorithms. I’ve never seen that done. I doubt anyone can do it. If you want to check for collisions, just simply use two different hash algorithms.

I was going to talk about the birthday attack. That’s the other thing I have in the list here. The birthday attack is simply this mathematical probability. They like to talk about it a lot in certifications like the CISPD and Security Plus. All it really is, is probability. The whole idea is if given enough sample size and with birthday is we mean, if we have more than 23 people in a room, or 23 or more, the likelihood of two people sharing the same birthday, not the same birth year, but the same birthday, is over 50%. That’s because we’re not comparing everyone to you. We’re compared to everyone to each other. It’s just simple probability.

They like to use this concept to say how easy it is to create a collision with hashing. It’s not really that easy. It might be relatively easy for MD%, but still like the Jack Black, James Brown and Barry White, those are simply images where the header or metadata of the image was altered. To be able to intelligently alter something, to make a malicious file look like a signed piece of software, in my opinion, it’s not going to happen, regardless of what people say about this birthday attack and how easy this is, it’s really not that simple.

That’s all I wanted to talk about today, what hashing is, one-way function. You can’t take the message digest and go backwards and recreate the piece of data used to generate the message digest. We talked about collisions. There are collisions with MD5. It’s been broken per se, so has SHA1. But if you use two different hash algorithms, even if you have a collision on one, you could easily tell the data is different. We went over an actual example with collisions. We also went over HashCalc. I’ll put the link to HashCalc as well as a link to Nat McHugh’s site where you can download the Barry White, Jack Black and James Brown images and test out the collisions yourself.

If you have any questions or comments, you can lead them beneath the video. Please subscribe to our channel. Click on the little bell so you get notified when we have new videos. Thanks for watching. Have a great one.