What is Similarity Digests?

2014年4月9日星期三

What is Similarity Digests?

這麼硬的題目，當然要來點有趣的圖片，~~才能遮掩我的無知(誤)~~，由圖片應該可以猜測文章的題目應該是跟明星臉有關(再誤)，這次的題目是 - Similarity Digests (相似度領悟?)

前幾天收到教育訓練的通知信，題目是 Similarity Digests: Hashes for data mining and big data，一看到是跟Data mining 和Big data 有關就立馬報名了(自動乎看不懂意思的Similarity Digests)......沒想到這才是痛苦的開始，因為內容跟Big Data ...至少跟我想像的不太一樣。

傳統的檔案比較通常使用MD5，SHA1..等，但是這是用來比較兩個檔案是否一樣，所以只要有改一點點東西，這兩個檔案的Hash值就會完全不一樣，但是如果我們想要了解這兩個檔案的相似程度呢？這時候就要使用 Similarity Digests，它主要的用途用來比較兩份個檔案的相似程度，包含執行檔，圖檔(非壓縮格式)，文字檔...等，不過當然還是有其限制和適用範圍。

目前比較有名的Similarity Digests 如下：

1. sdhash

sdhash is tool that allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during the triage and initial investigation phases. It has been in active development since 2010 with the explicit goal of becoming fast, scalable, and reliable.

2. tlsh (趨勢開發)

TLSH is a fuzzy matching library. Given a binary object, it generates a hash value. The hash values can be used for similarity comparison. Similar objects have similar hash values. Similar hash values signal similar objects.

3. nilsimsa [wiki] [github]

Nilsimsa is a distance based hash, which is the opposite of more familiar hashes like MD5. Instead of small changes making a large difference in the resulting hash (to avoid collisions), distance based hashes cause similar values to have similar output. This is good for detecting near similar documents without having to store the original text.

4. ssdeep

ssdeep is a program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length.

結論：

1. 如何看數值

tlsh 數值是由 0->100，越大代表越不相似。
ssdeep 數值是由 100->0，越小代表越不相似。
sdhash 數值是由 100->0，越小代表越不相似。

2. 如何選擇？

盡量使用sahash and tlsh 不要再用ssdeep

why? LDJ4....XD

有興趣的可以參考以下文件：

[1] Security and Implementation Analysis of the Similarity Digest sdhash
[2] Data Fingerprinting with Similarity Digests
[3] Similarity Comparison with SDHASH (fuzzy hashing)