Provide a detailed summary of the following web content, including what type of content it is (e.g. news article, essay, technical report, blog post, product documentation, content marketing, etc). If the content looks like an error message, respond 'content unavailable'. If there is anything controversial please highlight the controversy. If there is something surprising, unique, or clever, please highlight that as well: Title: Similar Image Search (2021) Site: blog.qwertyforce.dev Image retrieval system for my self-hosted photo gallery engine. One day I decided to build a photo gallery. I realized pretty quickly that "dumb" photo gallery is not a very interesting project, so I began to think about how to make it "smarter". Search is a really cool idea, I began my research and fell into the rabbit hole of image retrieval . I've created a set of microservices, which can be used for reverse image search/similar image search or tasks like tagging/image captioning. I tried to keep it simple, so people could understand how it works just by looking at the code and easily make any changes they want. Well, I looked at the code again, sometimes it's super confusing, I hope it's readable enough to modify it. current microservices: In the text below, I will try to briefly explain how they work, show search examples, runnable examples (Google Colab), links for further reading, and various tips/tricks. If you want to read about the photo gallery, click here Tip: All microservices use Pillow for image resizing, so you can use Pillow-SIMD as a drop-in replacement for Pillow. The results show that for resizing Pillow is always faster than ImageMagick, Pillow-SIMD, in turn, is even faster than the original Pillow by the factor of 4-6. >In general, Pillow-SIMD with AVX2 is always 16 to 40 times faster than ImageMagick and outperforms Skia, the high-speed graphics library used in Chromium. ( https://github.com/uploadcare/pillow-simd ) Web framework: FastAPI with Uvicorn as an ASGI. This is a very convenient combination for this kind of project because FastAPI is light and easy to use, with built-in Swagger, which makes development more convenient. phash_web, color_web, local_features_web, global_features_web, image_text_features_web use LMDB to store features on disk and use FAISS for constructing a search index. Previously, I used sqlite for storing numpy arrays, but it turned out to be extremely slow, because of the cost of serialization of numpy arrays (to bytes). LMDB, on the other hand, can write/read raw bytes from a memory-mapped file, it even can bypass unnecessary copying by the kernel. Since LMDB is memory mapped it is possible to access record data without keys or values ever being copied by the kernel, database library, or application. To exploit this the library can be instructed to return buffer() objects instead of byte strings by passing buffers=True ( https://lmdb.readthedocs.io/en/release/#buffers ) phash_web [ Colab ] Supported operations: add, delete, get similar by image id, get similar by image. Perceptual image hash functions produce hash values based on the image’s visual appearance. A perceptual hash can also be referred to as e.g. a robust hash or a fingerprint. Such a function calculates similar hash values for similar images, whereas for dissimilar images dissimilar hash values are calculated. Finally, using an adequate distance or similarity function to compare two perceptual hash values, it can be decided whether two images are perceptually different or not. Perceptual image hash functions can be used e.g. for the identification or integrity verification of images. ( https://www.phash.org/docs/pubs/thesis_zauner.pdf , page 12) There are a lot of various perceptual hashing algorithms, such as aHash, dHash, wHash, and pHash (You can check their implementations in python library ImageHash ). Usually, Hamming distance is used for comparing these hashes. Less distance -> probability of images being the same is higher (false positivities still can occur). PDQ Hash ( https://github.com/faustomorales/pdqhash-python ) - is a hashing algorithm designed by Facebook, inspired by pHash, with some optimizations, like bigger hash size, usage of Luminance instead of greyscale, and others. As well put in the PDQ paper, there are syntactic and semantic methods of hashing images: TMK+PDQF and PDQ are syntactic rather than semantic hashers. Algorithms in the latter category detect features within images, e.g. determining that a given image is a picture of a tree. Such algorithms are powerful: they can detect different photos of the same individual, for example, by identifying facial features. The prices paid are model-training time and a-priori selection of the feature set to be recognized. For copy-detection use cases, by contrast, we simply want to see if two images are essentially the same, having available neither prior information about the images, nor their context. ( https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf ) Although PDQ is potentially better than phash, I used phash DCT with 576 bit hash size, because it seems, that PDQ is less sensitive than phash. On the image below we can see, that hamming distance between these images is 22 for PDQ and 110 for phash. I rewrote phash code from ImageHash library( https://github.com/JohannesBuchner/imagehash/blob/master/imagehash.py#L197 ), which was implemented by this article https://hackerfactor.com/blog/index.php%3F/archives/432-Looks-Like-It.html . Let's go step by step with code examples: Step 1 Reduce color Step 2 Reduce size Step 3 Compute the DCT Step 4 Reduce the DCT Step 5 Compute the average value Step 6 Further reduce the DCT (make binary) Step 7 Construct the hash @jit(cache=True, nopython=True) def bit_list_to_72_uint8(bit_list_576): uint8_arr = [] for i in range(len(bit_list_576)//8): #here we convert our list of booleans into list of uint8 numbers (Boolean -> bit -> number) bit_list = [] for j in range(8): if(bit_list_576[i*8+j] == True): bit_list.append(1) else: bit_list.append(0) uint8_arr.append(bit_list_to_int(bit_list)) # convert list of 1's and 0's to a number and append it to array return np.array(uint8_arr, dtype=np.uint8) @jit(cache=True, nopython=True) def bit_list_to_int(bitlist): out = 0 for bit in bitlist: out = (out << 1) | bit return out @jit(cache=True, nopython=True) def diff(dct, hash_size): dctlowfreq = dct[:hash_size, :hash_size] #Step 4 Reduce the DCT med = np.median(dctlowfreq) #Step 5 Compute the average value diff = dctlowfreq > med #Step 6 Further reduce the DCT (make binary). This will produce list of booleans. if element in dctlowfreq is higher than median, it will become True,if else, it will become False return diff.flatten() def fast_phash(resized_image, hash_size): dct_data = dct(dct(resized_image, axis=0), axis=1) #Step 3 Compute the DCT return diff(dct_data, hash_size) def read_img_buffer(image_buffer): img = Image.open(io.BytesIO(image_buffer)) if img.mode != 'L': img = img.convert('L') # Step 1, convert to greyscale return img def get_phash(image_buffer, hash_size=24, highfreq_factor=4): # ENTRY POINT img_size = hash_size * highfreq_factor query_image = read_img_buffer(image_buffer) #<- Step 1 query_image = query_image.resize((img_size, img_size), Image.Resampling.LANCZOS) #Step 2, Reduce size query_image = np.array(query_image) bit_list_576 = fast_phash(query_image, hash_size) #<- Steps 3,4,5,6 phash = bit_list_to_72_uint8(bit_list_576) #<- Step 7 Construct the hash return phash In the code above we used decorator @jit, which is provided by Numba . Everyone knows that python is a slow language, so in order to speed up our computations, we want to off-load as many cpu-bound operations as possible to C-lang libraries like numpy, scipy, numba, etc. Numba is a just-in-time compiler, which runs code outside of python interpreter, thus making it significantly faster. To exploit the fact that most modern systems have more than 1 core, we can use multiprocessing libraries, like joblib . Joblib allows us to use more than 1 core of CPU with just a single line. phashes = Parallel(n_jobs=-1, verbose=1)(delayed(calc_phash)(file_name) for file_name in batch) Phash is robust to minor transformations such as artifacts of jpeg compression, minor blur/noise, and same image, but in lower/higher resolution (for example, original image and thumbnail). Example: hamming distance between these 2 images is 4. From my observations, similar images have distance <= 64. Pros: hash has a small size (576 bit or 72 bytes, 72 MB for 1 million images) can be calculated very quickly search is fast using an optimal threshold value, there are not a lot of false positives Cons: unstable to geometric transformations (for example, cropping, mirroring, rotations): gives a totally different hash. To make the search more resilient to these transformations, at search time we can search not only with phash of the original image but also with modified versions of the image: mirrored or rotated. In the microservice mirroring of the original image is used to address this problem. color_web [ Colab ] Supported operations: add, delete, get similar by image id, get similar by image. Let's compare rgb histograms (color distributions) of 2 images: image rgb histogram 256 bins rgb histogram 8 bins You can see, that image 1 and 2 have a similar color palette and their rgb histograms are also similar! To compare histograms of images with different sizes, we must flatten NxNxN matrix (N - number of bins) to a vector and then normalize it by dividing each bin by the total amount of pixels in the image. Also, we will use 8 bin histograms, because the size is less. 256 bin histogram will have a size of (256^3) 4 ~ 67 MB. 8 bin histogram is just (8^3) 4 ~ 2 KB. def get_features(query_image): query_hist_combined = cv2.calcHist([query_image], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256]) query_hist_combined = query_hist_combined.flatten() query_hist_combined = query_hist_combi