2018-11-25 06:22

Anonymize production data

Goal: local copies of production data must be anonymized to limit the damage of security breaches.

Check out this article to see why using production data on local development hardware is discouraged to begin with.

In case you must carry out performance tests on a large production dataset or in case you are debugging an issue which can’t be reproduced using test data you’re sometimes forced to download a copy of the customer’s data to your development machine.

To minimize the chance of data leaks (e.g. theft of your local development machine) you must run the anonymization script (on a copy of the database!) before your download the dataset.

This anonymization script must:

  • Randomize first and last names e.g. Jan de Boer becomes Lhx ct Rjow;
    • It’s advised to keep the same number of characters and spaces for the sake of "readability" (strings will still look a bit like names in this example).
    • It is advised to keep the type of character intact e.g. letters remain letters including casing, and digits remain digits to avoid validation errors or database constraint issues.
    • In case you really need readable first and last names, or street names, or any of the other fields, then it is OK to omit one (and just one) of the fields from the randomization process. In case of a data breach the attacker will only obtain one readable column. While this is (strictly speaking) still a data leak, it is not a major screw up since (most of) the data will be of a generic nature (people share the same name, the same street name occurs multiple times, etc).
  • Randomize street names and cities in the same way names are randomized e.g. Den Haag becomes Xkl Fyux.
  • Randomize phone numbers e.g. 06 46 277 984 becomes 06 00 000 000.
    • To avoid validation errors or database constraint issues it is advised to keep the number of digits intact. If your validations are very strict you may have to pick a couple of harmless real phone numbers.
  • Change IBAN and BIC into a limited set of test numbers.
  • Randomize unique usernames in such a way that collisions are avoided. E.g. jandeboer becomes a62b04902928b1db4d9cd6585f0b704f9279a13ad9332049b50742452d6ff225 which is the SHA2 hash of 36363-jandeboer, whereby 36363- acts as a random salt value to minimize the chance of a successful rainbow table attack.
  • Randomize email addresses e.g. jan.de.boer@example.com becomes lhx.ct.rjow@example.com. In case email addresses act as a unique usernames in your system then instead of randomizing characters you can use the same SHA2 hashing technique as described above.
  • Instead of using letter & digit randomization you can also use one of the many fake data generator implementations. Such libraries contain hundreds of fictitious first and last names, city names, et cetera.

A script which anonymizes production data should:

  • Be updated regularly by a developer whose responsibility it is to keep the anonymization script in sync with database schema changes. To ensure this actually happens it is advised to schedule a monthly review and require the developer to send a status report to his tech lead or other supervisor;
  • Be easy to run on staging and development hardware.

Tip: ensure the anonymization script cannot run on production databases by including a circuit breaker which detects e.g. an environment variable or some other signal only present in production environments.

§ Permalink

ξ Comments? Kudos? Find me on Twitter