Skip to content
5 min read·Lesson 6 of 10

Data Protection and DLP

Classifying data, preventing leaks, and complying with regulation. DSPM, DLP, tokenisation, masking, and the practical patterns for sensitive data in cloud.

Most cloud breaches that make the news are data breaches — sensitive records exposed in object storage, exfiltrated from databases, or leaked through misconfigured backups. Data protection is the work of knowing what you have, where it lives, and what controls apply.

Data Classification

Pick a small, simple scheme and use it everywhere:

ClassExamplesControls
PublicMarketing site, public blogStandard hardening
InternalInternal docs, non-PII analyticsAccess controls, no public exposure
ConfidentialCustomer PII, business dataEncryption at rest, restricted IAM, audit logging
RestrictedPayment data, health records, secretsTokenisation / field-level encryption, dedicated networks, regulated handling

Tag every resource (bucket, table, queue) with its classification. Many policies and detections key off these tags.

Data Discovery and DSPM

Data Security Posture Management (DSPM) tools scan your cloud to find sensitive data automatically.

  • AWS Macie — scans S3 buckets for PII, credit-card numbers, secrets.
  • Microsoft Purview — discovery and classification across Azure, M365, multi-cloud.
  • Google Cloud DLP / Sensitive Data Protection — content inspection across BigQuery, GCS, Datastore.
  • Third-party DSPM — Wiz, BigID, Cyera, Sentra. Cross-cloud, deeper graph analysis.

Outputs: "this bucket contains 1.2M social security numbers and is readable by ten roles outside this account." That is the conversation that gets remediation prioritised.

Data Loss Prevention (DLP)

DLP is enforcement: detecting and blocking sensitive data from going where it should not.

  • Egress scanning — proxies and gateways inspect outgoing traffic for sensitive patterns.
  • Email/chat DLP — Microsoft Purview DLP, Google Workspace DLP, Zscaler. Block emailing customer lists externally.
  • Database DLP — Imperva, Satori, Cyral. Mask sensitive columns in query results based on user role.
  • API gateway DLP — pattern detection at the edge to prevent SSN/CC leaking in responses.

Storage Hygiene: Stopping the Classic Leak

  • S3 / Blob / GCS Block Public Access — turn on at the account level.
  • Bucket policy explicit deny on non-VPC-endpoint access for sensitive data.
  • S3 / GCS / Blob versioning + Object Lock for ransomware resilience.
  • Lifecycle rules to move and eventually delete data per retention policy.
  • Logging of every object access (S3 server access logs / CloudTrail data events / equivalent).
{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": ["arn:aws:s3:::reports-prod", "arn:aws:s3:::reports-prod/*"],
  "Condition": {
    "StringNotEquals": { "aws:SourceVpce": "vpce-0abc123" }
  }
}

Field-Level Protection

For the most sensitive fields, encryption at rest is not enough. The field is decrypted into memory whenever the row is read. Two stronger approaches:

  • Field-level encryption — encrypt specific columns with separate keys; only services with that key can decrypt.
  • Tokenisation — replace the sensitive value with an opaque token. The original lives in a vault that very few services can reach. Common in PCI scope reduction (you keep tokens, your vault keeps PANs).
  • Format-preserving encryption — output looks like the input format (still 16 digits, still passes Luhn) so legacy systems cope.

Masking and Anonymisation

  • Static masking — produce a sanitised copy of prod for non-prod use. AWS RDS / Azure SQL / BigQuery offer functions to mask on copy.
  • Dynamic masking — at query time, return masked values to non-privileged users. Snowflake dynamic data masking, BigQuery column policy tags.
  • Tokenisation pipelines — replace sensitive fields before data reaches analytics warehouses.

Rule: production data should never appear in dev environments unmasked. Most data leaks involve developer or analyst access to data they did not actually need.

Backups and Ransomware

  • Backups in a separate account / subscription / project — different IAM blast radius.
  • Object Lock / immutable backups so even an admin cannot delete them within the retention window.
  • Cross-region copies for disaster recovery.
  • Test restores quarterly. A backup you cannot restore is not a backup.

Privacy and Regulation

RegulationKey cloud-relevant rules
GDPR (EU)Lawful basis, right to access/erasure, breach notification within 72h, data localisation considerations
HIPAA (US healthcare)BAA with cloud provider; eligible services only; encryption strongly recommended
PCI DSS (cards)Reduce scope via tokenisation; segment networks; quarterly scans; specific log retention
SOC 2Controls demonstrating Security, Availability, Processing Integrity, Confidentiality, Privacy
ISO 27001Information Security Management System; risk-based control selection
CCPA/CPRA, LGPD, PIPEDARegional privacy regimes mirroring GDPR concepts

Cloud providers maintain compliance for the platform; you are responsible for the configuration. Use compliance-mapped controls (AWS Audit Manager, Azure Compliance Manager, GCP Assured Workloads) to track.

Right to Erasure

GDPR's "right to be forgotten" forces architectural choices:

  • Centralise PII; do not scatter user IDs and emails across every microservice's database.
  • Build a deletion pipeline that propagates erasures across systems and is auditable.
  • Backups are usually exempt for retention reasons but must be unable to be queried by user — encryption + key destruction is one solution.

Common Failure Modes

  • Public S3 / GCS / Blob bucket containing customer data.
  • Snapshot or backup shared with all accounts ("any AWS account can copy this snapshot").
  • Production database snapshot restored into staging without masking.
  • Logs containing tokens, JWTs, or credit-card numbers.
  • Accidentally sending PII to a third-party analytics or AI service in violation of policy.

The biggest defence is fewer copies, fewer places. Centralise, classify, and automate the controls. The next lessons cover detection — knowing when a control fails is as important as setting it.

Key Takeaways

  • You cannot protect what you have not classified — start with data discovery.
  • DLP and DSPM tools find sensitive data and tell you when it moves where it should not.
  • Tokenisation and field-level encryption protect data in use, not just at rest.
  • Object storage misconfigurations are the most common cloud data leak vector.
  • GDPR, HIPAA, PCI DSS impose specific data-handling requirements that drive architectural choices.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →