AI Data Privacy: PII Detection, Data Anonymization, Local Processing

Introduction

AI applications process vast amounts of data, much of it containing personally identifiable information (PII). Sending raw PII to LLM APIs creates compliance risks under GDPR, CCPA, and other regulations. This article covers practical techniques for detecting and redacting PII, anonymizing training data, and processing sensitive information locally.

PII Detection

Automated detection identifies sensitive data before it reaches an LLM API:

import re

import spacy

from presidio_analyzer import AnalyzerEngine

from presidio_anonymizer import AnonymizerEngine

Initialize Presidio analyzers

nlp = spacy.load("en_core_web_lg")

analyzer = AnalyzerEngine()

anonymizer = AnonymizerEngine()

def detect_pii(text: str) -> list[dict]:

results = analyzer.analyze(

text=text,

entities=[

"PHONE_NUMBER", "EMAIL_ADDRESS",

"CREDIT_CARD", "SSN", "PERSON",

"LOCATION", "DATE_TIME", "NRP",

"US_BANK_NUMBER", "IP_ADDRESS",

language="en",

)

return [

{"entity": r.entity_type, "start": r.start, "end": r.end,

"score": r.score, "text": text[r.start:r.end]}

for r in results

]

def redact_pii(text: str) -> str:

analyzer_results = analyzer.analyze(text=text, language="en")

return anonymizer.anonymize(text=text, analyzer_results=analyzer_results).text

Presidio combines pattern-based detection (regex for credit cards, SSNs, phone numbers) with NLP-based detection (spaCy for person names, locations, organizations). This dual approach catches both structured and unstructured PII.

Data Anonymization

For training data or analytics, full removal may be too destructive. Anonymization preserves utility while protecting privacy:

from faker import Faker

import hashlib

fake = Faker()

class DataAnonymizer:

def init(self):

self.mapping_cache = {}

def anonymize_record(self, record: dict, pii_fields: list[str]) -> dict:

anonymized = record.copy()

for field in pii_fields:

if field in anonymized and anonymized[field]:

anonymized[field] = self._replace_value(field, anonymized[field])

return anonymized

def _replace_value(self, field: str, value: str) -> str:

if field == "email":

return fake.email()

elif field == "phone":

return fake.phone_number()

elif field == "name":

return fake.name()

elif field == "address":

return fake.address()

elif field == "ssn":

return fake.ssn()

else:

Tokenization: stable pseudonym via hashing

hashed = hashlib.sha256(value.encode()).hexdigest()[:16]

return f"USER_{hashed}"

Differential privacy: add calibrated noise

def add_laplace_noise(true_value: float, epsilon: float = 1.0) -> float:

"""Add Laplace noise for differential privacy.

Lower epsilon = more privacy, less accuracy."""

import numpy as np

scale = 1.0 / epsilon

noise = np.random.laplace(0, scale)

return true_value + noise

Anonymization Strategies

|-----------|--------------|---------|----------|

Local Processing

For maximum privacy, process sensitive data locally without sending it to external APIs:

from transformers import pipeline

class LocalTextProcessor:

def init(self):

Load small models for local inference

self.classifier = pipeline(

"text-classification",

model="distilbert-base-uncased-finetuned-sst-2-english",

device=-1, # CPU

)

self.ner = pipeline(

"ner",

model="dslim/bert-base-NER",

device=-1,

)

self.summarizer = pipeline(

"summarization",

model="facebook/bart-large-cnn",

device=-1,

)

def process_sensitive_data(self, text: str, task: str) -> dict:

All processing happens locally; nothing leaves this machine

if task == "classify":

return {"label": self.classifier(text)[0]["label"]}

elif task == "extract_entities":

return {"entities": self.ner(text)}

elif task == "summarize":

return {"summary": self.summarizer(text, max_length=130, min_length=30)[0]["summary_text"]}

Hybrid Approach

For complex tasks requiring powerful LLMs, strip PII before sending, then re-integrate after:

def safe_llm_call(user_text: str) -> str:

Step 1: Detect and redact PII

pii_entities = detect_pii(user_text)

redacted_text = redact_pii(user_text)

Store PII mapping for later restoration

pii_map = {

entity["text"]: f"[{entity['entity']}_{i}]"

for i, entity in enumerate(pii_entities)

}

Step 2: Send redacted text to LLM

safe_prompt = f"Process this text: {redacted_text}"

response = call_llm(safe_prompt)

Step 3: The response should use placeholders, not real data

No restoration needed if the LLM just processes the structure

return response

For cases where PII must be restored:

def process_and_restore(user_text: str, context: dict) -> str:

redacted, pii_map = redact_with_map(user_text)

result = call_llm(f"Based on this data: {redacted}, generate a response.")

The LLM response should reference PII generically

return result

Compliance Checklist

data_privacy_audit:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\- pre_processing:

Conclusion

AI data privacy requires proactive protection rather than reactive compliance. Detect and redact PII before any external API call. Use local models for sensitive processing when possible. Implement a hybrid approach for complex tasks: strip PII before cloud LLM inference, and never log raw user data. Regular privacy audits ensure that your protection measures stay effective as your application evolves.