Provide a detailed summary of the following web content, including what type of content it is (e.g. news article, essay, technical report, blog post, product documentation, content marketing, etc). If the content looks like an error message, respond 'content unavailable'. If there is anything controversial please highlight the controversy. If there is something surprising, unique, or clever, please highlight that as well: Title: Run CLIP on iPhone to search photos Site: I built an app called Queryable, which integrates the CLIP model on iOS to search the Photos album OFFLINE. It is available on App Store today and I thought it might be helpful to others who are as frustrated with the search function of Photos as I was, so I wrote this article to introduce it. CLIP CLIP (Contrastive Language-Image Pre-Training) is a model proposed by OpenAI in 2021. CLIP can encode images and text into representations that can be compared in the same space. CLIP is the basis for many text-to-image models (e.g. Stable Diffusion) to calculate the distance between the generated image and the prompt during training. To run on iOS devices in real time, I made a compromise between the performance and the model size, and finally chose the ViT-B-32 model, separated the Text Encoder and Image Encoder . In ViT-B-32 : Text Encoder will encode any text into a 1x512 dimensional vector. Image Encoder will encode any image into a 1x512 dimensional vector. We can calculate the proximity of a text sentence and an image by finding the cosine similarity between their text vector and image vector . The pseudo code is as follows: import clip model, preprocess = clip.load( "ViT-B/32" , device=device) image_feature = model.encode_image( "photo-of-a-dog.png" ) text_feature = model.encode_text( "rainly night" ) sim = cosin_similarity(image_feature, text_feature) Integrate CLIP into iOS I exported the Text Encoder and Image Encoder to CoreML model using coremltools library. The final models has a total file size of 300MB. Then, I started writing Swift. Here is how to do inference with Text Encoder on Swift: let text_encoder = try MLModel( contentsOf : TextEncoderURL, configuration : config ) let text_feature = text_encoder.encode( "a dog" ) The reason I split Text Encoder and Image Encoder into two models is because, when actually using this Photos search app, your input text will always change, but the content of the Photos library is fixed. Which means that all the image vectors can be computed once and saved in advance. Then, the text vector is computed only once for each of your searches. Thus, real-time text searching from tens of thousands of Photos library becomes possible. Below is a flowchart of how Queryable works : Performance But, compared to the search function of the iPhone Photos , how much does the CLIP-based album search capability improve? The answer is: overwhelmingly better . With CLIP, you can search for a scene in your mind, a tone, an object, or even an emotion conveyed by the image. To use Queryable, you need to first build the index , which will traverse your album, calculate all the image vectors and store. This takes place only ONCE, the total time required for building the index depends on the number of your Photos, the speed is of ~2000 photos per minute on iPhone 12 mini. When you have new photos, you can manually update the index, which is very fast. The time cost for a search also depends on your Photos number, For <10,000 photos it takes less than 1s . For me, an iPhone 12 mini user with 35,000 photos, each search takes about 2.8s . I made a video to demonstrate the search capabilities of Queryable: QA 1.On Privacy and security issues. Queryable is designed as an OFFLINE app that does not require a network connection and will NEVER request network access, thereby avoiding privacy issues. 2.What if my pictures are stored in iCloud? Due to the inability to connect to a network, Queryable can only use the cache of the low-definition version of your local Photos album. However, the CLIP model itself resizes the input image to a very small size (e.g. ViT-B-32 is 224x224), so if your image is stored in iCloud, it actually does not affect search accuracy except that you cannot view its original image in search result. 3. Any requirements for the device? iOS 16.0 or above iPhone 11 ( A13 chip ) or later models 4.Have some suggestions or product experience issues? Feel free to contact me by email: myfancoo@gmail dot com.