CLIP

CLIP, short for Contrastive Language-Image Pretraining, is the dual-encoder model architecture introduced by OpenAI in 2021 (Radford et al.) that learns a shared embedding space for images and text by training on 400 million image-caption pairs scraped from the web. CLIP's two encoders — a Vision Transformer for images and a text transformer for captions — are trained jointly using contrastive loss: matching image-caption pairs are pulled together in embedding space while non-matching pairs are pushed apart. The result is a model where you can compute the similarity between any image and any text string, enabling zero-shot image classification (no fine-tuning needed — just provide candidate class names as text), image search by natural-language query, and the visual grounding layer of every modern multimodal system. CLIP variants now dominate: OpenCLIP (LAION's open-weight reimplementation with multiple training scales), SigLIP (Google's improved variant using sigmoid loss), EVA-CLIP, Chinese CLIP, FashionCLIP (domain-tuned), and BiomedCLIP (medical). CLIP embeddings power Stable Diffusion's text conditioning, the image side of GPT-4V and Claude 3.5 Sonnet, and most visual RAG systems. A practical recipe: pip install open_clip_torch; model, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k'); image_features = model.encode_image(preprocess(image).unsqueeze(0)); text_features = model.encode_text(tokenizer(['a photo of a cat', 'a photo of a dog'])); similarities = (image_features @ text_features.T).softmax(dim=-1). AI governance teams scrutinize CLIP because its training data (LAION-400M, LAION-5B) is web-scraped and includes copyrighted images, personal photos, and biased representations — using CLIP downstream means inheriting those issues unless mitigations are applied.

CLIP-powered visual search on 25 years of document ingestion: Centralpoint's 25-year document ingestion pipeline already extracts images from PDFs, Office files, and web content for client CMS — adding CLIP embeddings to those images makes visual search a natural extension of the existing index. CLIP runs on-premise with open-weight checkpoints, tokens meter per skill, and visual-search chatbots deploy through one line of JavaScript.

Related Keywords:
CLIP,CLIP,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,

Back