Demo of Encoder-Free VLM Trained for $100
Play with this encoder-free vision-language model, inspired by the architecture of Gemma 4 12B Unified. Our model was trained for about $100 (43 hours on a single H100). It used Qwen 3 1.7B as a decoder and a subset of FineVision as training data.
To get started, upload an image and text or try one of the examples. This demo doesn't use history for the chat, so every chat you start is a new conversation.
Read more about how this model was trained in our blogpost: Train Your Own Encoder-Free VLM in $100.
Query Input