Model Inversion

A model inversion attack is the privacy threat where an adversary exploits a deployed model's behavior to reconstruct features of training data — recovering faces from face-recognition models, recovering text passages from embedding vectors, recovering medical records from clinical-prediction models. The attack was introduced by Fredrikson et al. (2015) demonstrating face reconstruction from a face-recognition API and has since been demonstrated across many model types and modalities. For LLMs, model inversion takes specific forms: embedding inversion (Morris et al. 2023 showed that sentence embeddings from major commercial embedding models can be inverted to recover the original text 90%+ of the time with sufficient adversarial training), training data extraction (Carlini et al. demonstrated verbatim extraction of phone numbers, email addresses, code, and copyrighted text from GPT-2 and beyond), and prompt extraction (extract the system prompt of a deployed application via clever queries). The implications are profound for RAG systems where the vector database contains embeddings of proprietary content — historically treated as "just numbers" — that turn out to be partially invertible. Defenses include: protect embedding databases at the same security tier as the source documents (encryption at rest, access control, audit logging); avoid exposing raw embeddings to untrusted clients; use embedding models with privacy-aware training (differential privacy applied to the embedding model); for training-data extraction, deduplicate aggressively (Carlini showed that the more times a string appears in training data, the easier it is to extract); for prompt extraction, use system-prompt protection techniques and detect extraction patterns. AI governance teams classify embedding databases as sensitive data at the source-document tier, not below it, because the inversion threat means the embeddings carry the same content risk as the original passages.

Embedding security at the source tier, from 25 years of sensitivity classification: Centralpoint classifies and protects embeddings at the same sensitivity tier as the underlying content — a discipline Oxcyon has applied to all derived artifacts (search indexes, summaries, taxonomies) for 25 years. Embeddings stay on-premise, tokens meter per skill, and inversion-resistant chatbots deploy through one line of JavaScript.

Related Keywords:
Model Inversion,Model Inversion,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,

Back