Detecting and Redacting PII: Microsoft Presidio & Scrubadub

EC

Jun 05, 2025By Ethan Coulthard

Introduction

In today’s world, protecting personally identifiable information (PII) is essential. PII includes any details—names, emails, phone numbers, IDs—that can identify an individual. Accidental exposure can lead to identity theft, legal fines, and reputational damage. Open-source PII detectors like Microsoft Presidio and Scrubadub help organizations find and remove sensitive data before it’s logged, shared, or analyzed. This post gives a concise, easy-to-follow overview of each tool’s main features, use cases, and how they support compliance, with placeholders for your own demos and screenshots.

Why PII Detection Matters

Regulations such as GDPR, HIPAA, and CCPA require organizations to minimize and protect personal data. By automatically scanning text—and even documents—PII detectors reduce risk by masking or replacing sensitive tokens. This safeguards user privacy, cuts down on manual review, and builds trust with customers and partners.

Microsoft Presidio

Microsoft Presidio is a flexible framework built for advanced PII detection and anonymization. It combines machine learning and regex-based approaches in a modular pipeline.

Key Features

  • Modular Pipeline
    • Separate components for detection (“recognizers”) and redaction/anonymization (“anonymizers”), all configurable through JSON or Python.
  • Mix of ML & Regex
    • ML models (via spaCy or transformers) catch context-sensitive PII.
    • Regex patterns cover well-known formats (emails, SSNs, phone numbers).
    • You can add custom recognizers for domain-specific IDs.
  • Anonymization Options
    • Redact (replace text with a mask like “****”).
    • Replace (substitute with realistic fake data).
    • Hash/Tokenize (irreversibly transform PII for logging).
  • Deployment Flexibility
    • Use Presidio as a Python library, a Dockerized microservice, or even as an Azure Function.

Common Use Cases For Presidio

  • Log & Transcript Redaction: Automatically remove names, emails, and IPs before storing logs or transcripts.
  • Data Analytics Prep: Strip PII from large text corpora (customer feedback, surveys) so you can analyze data safely.
  • Document Sanitization: Clean up contracts, reports, and medical records before sharing externally.
     

Presidio Demo

The demo will be done with the following text: "I’m applying for a mortgage and they asked for my full name (Sarah T. Dawson), SSN (456-78-9012), birth date (12/22/1990), address (245 Sunset Blvd, Miami, FL), and employment details." 

As you can see from the image Presidio found two locations to redact as well as a person/name and a United States Social Security Number. These recognizers can be further customized to take out more information or keep more information in the text.

Scrubadub

Scrubadub is a lightweight Python library focused on easy PII removal via rule-based “cleaners.”

Key Features

  • Cleaner Plugins
    • Each “cleaner” targets a specific PII type (e.g., emails, names, phone numbers). You can enable only the cleaners you need.
  • Simple, Regex-Driven
    • Relies mostly on regular expressions, so installation is straightforward and runtime overhead is minimal.
  • Custom Cleaners
    • If your industry has unique identifiers (e.g., medical record numbers), you can write and register a new cleaner.
  • Placeholder Substitutions
    • Defaults to replacing detected PII with tags like {{EMAIL}} or {{PHONE}}, but you can customize tokens or generate fake values.

Common Use Cases for Scrubadub

  • Email & Name Redaction: Quickly clean forum posts, chat logs, or survey responses.
  • Data Sharing Prep: Before sharing datasets, strip out obvious PII so downstream users only see anonymized text.
  • Content Moderation: Integrate into workflows to catch accidental PII leaks in user-generated content.

Scrubadub Demo

The demo will be done with the following text: "I’m applying for a mortgage and they asked for my full name (Sarah T. Dawson), SSN (456-78-9012), birth date (12/22/1990), address (245 Sunset Blvd, Miami, FL), and employment details." 

As shown by the image by default scrubadub only redacts the SSN. By leveraging cleaner plugins mentioned above you can expand scubadub's capabilities. It is also worth mentioning that scrubadub is extremely lightweight as shown by the screenshot.

Compliance Benefits

Both tools can help you align with privacy regulations by ensuring PII is removed before it’s logged, stored, or shared:

GDPR (EU)
Emphasizes data minimization. Redacting PII from analytics or logs helps you process only the data you need.

HIPAA (US Healthcare)
Protects health-related PII (PHI). Integrating a detector into your EHR pipeline or research dataset workflow reduces risk of accidental PHI exposure.

CCPA/CPRA (California)
Requires businesses to delete or anonymize user data on request. Automated detection makes it easier to locate and remove specific personal information.

By customizing recognizers or cleaners to match the PII categories you care about—names, phone numbers, account IDs—you reduce audit scope and lower the chance of non-compliance.

Best Practices

  • Start Small
    • Pilot PII detection on a subset of logs or documents to tune settings and catch false positives/negatives.
  • Tune Detection
    • Presidio: Adjust confidence thresholds on ML recognizers to balance missed cases versus over-masking.
    • Scrubadub: Refine regex patterns or add new cleaners for domain-specific identifiers.
  • Combine Approaches
    • For maximum coverage, run Scrubadub’s regex cleaners first, then Presidio’s ML pipeline (or vice versa), catching both obvious patterns and context-based PII.
  • Monitor Performance
    • Presidio: ML-based detection can be resource-intensive—consider batch processing or microservices behind a queue.
    • Scrubadub: Regex-heavy scans on large text can also be CPU-intensive; profile for your volume.
  • Audit & Logging
    • Keep a record of what PII was detected (type, context) for compliance reporting and potential post-incident analysis.

Conclusion

Protecting PII is a must-have in any data-driven organization. Microsoft Presidio offers a powerful, extensible pipeline combining ML and regex for high-accuracy detection, while Scrubadub provides a simple, lightweight way to strip sensitive tokens via rule-based cleaners. Both tools ease the path to compliance—whether you’re subject to GDPR, HIPAA, or CCPA—by automating the otherwise manual task of finding and redacting PII. If your company is struggling to meet compliance TechHorizon Consulting can help. With our vCISO service we can help you company reach compliance whether you have to follow GDPR, HIPPA, NCUA or CCPA. If this interests you please visit our "Contact Us" page.