2018-11-24 19:33 — By Erik van Eykelen

Use Generated Test Data

Test data should be generated by code instead of relying on copies of production data.

As a developer and tester it is important to work with a representative dataset in the apps you’re working on. No dataset, or a very small dataset, makes it hard to test different scenarios and corner-cases. A very large (production) database may slow things down as its content is constantly changing, making it harder to quickly navigate to the data or screens you need to test.

Working with production data should be avoided because:

  • It’s probably against the law (GDPR).
  • It increases the likelihood of overwriting production data in case of a human error (upload vs download).
  • The size of production data may be larger than actually needed locally.
  • It increases the chance of leaking personal data (hardware theft, security breach).

A script which generates test data should:

  • Be updated regularly by a developer whose responsibility it is to keep the test data script in sync with database schema changes. To ensure this actually happens it is advised to schedule a monthly review and require the developer to send a status report to his tech lead or other supervisor.
  • Generate a broad range of content including:
    • Personas with names, phone numbers, email addresses.
    • Street addresses.
    • Profile pictures.
    • News items including photos.
    • Attachments such as PDFs.
    • IBAN/BIC numbers.
    • Strings including XSS payloads.
    • Unicode characters such as ë and emojis.
    • Odd data such as single character first and last names.
    • Strings with spaces prepended and appended.
    • Multi-line strings (e.g. news articles with line breaks).
    • Titles and captions with many characters (to test truncation and layout issues).

It should be easy to populate development, staging, or review app databases with generated data.

In the following example 31 tables are populated with 4887 random as well as predictable records:

~/projects/some-rails-app>rails db:seed
Added 4 funds
Added 63 users
Added 11 companies
Added 11 company users
Added 21 ventures
Added 42 venture users
Added 84 educations
Added 126 experiences
Added 294 expertises
Added 168 achievements
Added 21 locations
Added 8 rounds
Added 1 deal types
Added 84 deals
Added 252 deal events
Added 441 watchlists
Added 441 early accesses
Added 63 participations
Added 252 documents
Added 189 assigned documents
Added 2 assets
Added 20 news items
Added 21 decks
Added 483 deck parts
Added 63 indicators
Added 588 indicator points
Added 84 deck bookmarks
Added 84 deck attachments
Added 63 messages
Added 21 venture updates
Added 882 notifications

It should be easy to edit existing or add new test data by using a “fake data” generator:

[User::USER_TYPE_INVESTOR, User::USER_TYPE_FOUNDER, User::USER_TYPE_ADMIN].each_with_index do |user_type, idx1|
  1.upto 20 do |idx2|
    first_name = Mockdata::People.first_name
    last_name = Mockdata::People.last_name
      user_type: user_type,
      email_address: "#{first_name}.#{last_name}-#{idx1}-#{idx2}@example.com".downcase,
      phone_number: "+31 646 000 000",
      first_name: first_name,
      last_name: last_name,
      password: "11223344",
      gender: [User::GENDER_MALE, User::GENDER_FEMALE, User::GENDER_OTHER].sample

See https://github.com/evaneykelen/mockdata for an example of a Ruby library which provides fake names of people, companies, and projects. There are similar libraries for C# and other languages.

Running your test data script should generate actual database records:

> ap User.first
  User Load (0.9ms)  SELECT  "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT $1  [["LIMIT", 1]]
#<User:0x00007fd7b5b648d0> {
                          :id => "0287ee7e-cae1-48d7-b093-64fd7840df46",
                   :user_type => "investor",
                      :gender => "male",
               :email_address => "investor@example.com",
                :phone_number => "+31646000000",
                      :prefix => nil,
                  :first_name => "Paul",
                       :infix => nil,
                   :last_name => "Graham",
                     :postfix => nil,
                 :website_url => nil,
                    :blog_url => nil,
                 :twitter_url => nil,
                :facebook_url => nil,
               :instagram_url => nil,
                :linkedin_url => nil,
             :password_digest => "$2a$10$nIWqjnHVTNaVuHixwiLCcOsA/GG54jkgW2TfBsC3wCaOfQdd5C3JW",
                    :timezone => "Amsterdam",
                   :signatory => false,
                    :asset_id => nil,
    :last_viewed_dashboard_at => nil,
             :last_invited_at => Fri, 02 Nov 2018 12:00:35 UTC +00:00,
             :svg_path_paraph => nil,
          :svg_path_signature => nil,
                  :created_at => Fri, 02 Nov 2018 12:00:20 UTC +00:00,
                  :updated_at => Fri, 02 Nov 2018 12:00:35 UTC +00:00,
                       :brand => "uplane"

These code/output snippets above are merely an example. Test data can be generated in any programming language and its UI can be a CLI or GUI.

Tip: ensure the test data script cannot run on production databases by including a circuit breaker which detects e.g. an environment variable or some other signal only present in production environments.

Tip: although generated test data is mostly random, it is wise to limit the randomness to achieve a level of predictability. For instance it’s good to pick a random city name from a set of just 10 pre-defined names so that testers always know which names they can input in search fields.

Check out my product Operand, a collaborative tool for due diligences, audits, and assessments.