Enterprise PHI De-Identification Across eClinicalWorks & AthenaOne EMRs
About the Client
A multi-clinic neurology and ambulatory care network operating in a HIPAA-regulated environment. The organization manages high-volume structured and unstructured patient data across two major EMR platforms:
-
-
- eClinicalWorks (eCW)
- athenaOne
-
The network required secure data sharing capabilities to support research and advanced analytics initiatives without compromising patient privacy or regulatory compliance.
Problem Overview
The client faced a regulatory and technical challenge: enabling research-ready datasets while fully complying with Health Insurance Portability and Accountability Act (HIPAA).
Scale & Complexity
eCW Environment
-
-
- 900+ tables
- 20,000+ columns
- 7M+ patient records
-
AthenaOne Environment
-
-
- 450+ tables
- 8,000+ columns
- 2M+ patient records
-
Across both systems:
-
-
- 28,000+ columns processed
- Deeply embedded PHI in XML, HTML, and relational fields
- Narrative PHI in clinical notes, progress notes, family history, communications
-
The key challenge was not only masking direct identifiers but also detecting and removing contextual and inferred PHI embedded within unstructured data — without degrading research utility.
Strategic Requirements
-
-
- 100% PHI removal or anonymization
- Preservation of longitudinal patient timelines
- Maintenance of relational integrity across 1,350+ heterogeneous tables
- Re-linkable pseudonymization for research consistency
- Auditability and compliance traceability
-
What Santeware Built
1. Comprehensive PHI Identification Framework
PHI was classified into three risk tiers:
| Risk Level | Category | Approach |
| Level 1 | Direct Identifiers (Name, Phone, DOB, SSN) | Masked or pseudonymized |
| Level 2 | Contextual Identifiers (Location references, provider names) | NLP detection + replacement |
| Level 3 | Inferred Risk (Pattern-based identity clues) | Heuristic and pattern validation |
2. Unstructured Data Processing Engine
A major complexity was deeply embedded PHI in narrative fields.
Santeware implemented:
-
-
- Stored-structure cross-reference validation
- Clinical-domain NLP model for PHI detection
- Regex-based pattern recognition
- Redaction and contextual replacement logic
-
Example transformation:
“Mrs. Jane Doe was admitted on 12/04/2023…” → “<> was admitted on <>…”
This ensured privacy protection while preserving semantic coherence.
3. Patient-Specific Date Offset Logic
To maintain longitudinal research integrity:
-
-
- Each patient received a unique randomized date offset (±30–38 days)
- Internal event order preserved
- Master audit offset registry maintained in a secured environment
-
This allowed time-series research without revealing actual encounter dates.
4. Structured Data Masking at Scale
Santeware implemented:
-
-
- Schema crawler to automate mapping across 1,350+ tables
- ID re-generation logic for patient, encounter, insurance, and appointment identifiers
- ZIP truncation and geographic generalization
- Consistency engine to ensure identical pseudonyms across all tables
-
550+ tables were rendered research-ready, including:
-
-
- Patient demographics
- Encounters
- Insurance
- Clinical notes
- Labs and vitals
- Communication logs
- Family history
-
5. Master De-Identification Engine Components
-
-
- Mapping Engine: Generated anonymized identities
- Consistency Engine: Ensured cross-table replacement accuracy
- Audit Logging Framework: Maintained before-after traceability
- Schema Integrity Validator: Preserved foreign keys and relational join
-
Validation & Quality Assurance Framework
Santeware implemented a multi-layer validation pipeline:
-
-
- Regex-based PHI residue detection
- Inverse QA to detect false negatives
- Manual audits across 20+ patient datasets
- Clinical coherence validation
- Schema integrity verification
-
This ensured both compliance and research usability.
The Impact
The initiative successfully transformed high-risk clinical datasets into research-compliant assets.
-
-
- 100% PHI masked or removed
- 550+ tables de-identified and research-ready
- 28,000+ columns processed across dual EMR systems
- Longitudinal timelines preserved
- Master audit sheet generated for every patient
- Cross-EMR PHI alignment achieved
- Fully compliant with HIPAA and internal security controls
-
The resulting datasets enabled secure analytics, AI research initiatives, and advanced reporting without exposing sensitive patient information.
Outcome
Santeware executed enterprise-scale PHI de-identification across two major EMR systems with high PHI density, complex relational dependencies, and unstructured clinical narratives. By combining schema automation, NLP-driven redaction, patient-level temporal logic, and rigorous validation controls, we delivered a secure, compliant, and analytically robust dataset. This framework now serves as a scalable model for large healthcare organizations seeking secure research enablement without regulatory exposure.