I think image-encoder from CLIP (even smallest variant ViT B/32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually...
Similar Articles (10 found)
π 71.4% similar
https://medium.com/@mustafaakin/indexing-icloud-photos-with-ai-using-llava-and-pgvector-fd58182febf6
Indexing iCloud Photos with AI Using LLaVA and pgvector
A straightforward idea, gluing stuff together until it works, but itβs a glimpse of whatβs pos...
π 67.7% similar
Things we learned about LLMs in 2024
31st December 2024
A lot has happened in the world of Large Language Models over the course of 2024. Hereβs a rev...
π 67.4% similar
The Rise of Multimodal LLMs and Efficient Serving with vLLM
In this tutorial, you will learn how multimodal LLMs like LLaVA, GPT-4V, and BakLLaVA comb...
π 66.8% similar
Table of Contents
- Meet BLIP: The Vision-Language Model Powering Image Captioning
- What Is Image Captioning and Why Is It Challenging?
- Configuring...
π 65.3% similar
Table of Contents
- SmolVLM to SmolVLM2: Compact Models for Multi-Image VQA
- SmolVLM 1: A Compact Yet Capable Vision-Language Model
- What Is SmolVLM...
π 64.5% similar
> the generation of 281,128 augmented examples, from which 1,000 were
held out as a benchmark test set.
This model is trained on a custom dataset of 2...
π 63.5% similar
Multi-modal ML with OpenAI's CLIP
Language models (LMs) can not rely on language alone. That is the idea behind the βExperience Grounds Languageβ pape...
π 63.4% similar
Veo 3 shows emergent zero-shot abilities across many visual tasks, indicating that video models are on a path to becoming vision foundation modelsβjus...
π 62.6% similar
A HN user asked me0 how I run LLMs locally with some specific questions, Iβm documenting it here for everyone.
Before I begin I would like to credit t...
π 61.7% similar
Thanks for writing this one Simon, I read it some time ago and I just wanted to say thanks and recommend it to folks browsing the comments, it's reall...