How to Install and Use Replibyte to Assist with Database Development
Traducciones al EspañolEstamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.
Database testing is a critical component of the quality-assurance cycle, but using production data is inherently insecure. Additionally, the sheer volume of data can be difficult to work with. Unfortunately, it is difficult to create realistic “fake” data, and the results might not be representative. Replibyte (also stylized as RepliByte) allows users to transform their production data and use the results to seed a test database. This guide explains how to install Replibyte and how to use it to transform a dataset.
What is Replibyte?
Replibyte transforms existing production data into seed data suitable for non-production environments including development, testing, and customer demos. It prevents unauthorized personnel, such as research and development engineers, from obtaining access to the live data. This anonymous dataset also ensures the integrity of the original data, preventing it from being accidentally altered and then redeployed at a later time.
The transformation process obscures any sensitive real-world details to enhance security while retaining the essential characteristics of the original data set. This ensures the resulting database is based on real-world data. The new database should be roughly equivalent to the production database in terms of the number of rows and distribution of data it contains.
To help users effectively deal with very large databases, Replibyte can subset the original data. This operation restricts the new datastore to a subsection of the original data. A smaller database is easier to work with, requires less bandwidth to transfer, and takes less time to search.
The complete Replibyte process from source to destination database follows the following steps:
- Replibyte accesses the source database and takes a full SQL dump of the data. The source database can either be on the same system or on a remote system.
- Replibyte reads and parses the data.
- (Optional) For some database types, Replibyte can scale the original data down to a fractional subset. This process shrinks the number of database entries to a certain percentage of the original. The subset operation is currently only supported on PostgreSQL.
- Replibyte transforms the original database records, changing or hiding the values of one or more columns. These operations can obfuscate or trim the data, randomize strings, or auto-generate completely new values.
- (Optional) Replibyte can compress the data to reduce storage requirements. It can also encrypt the modified data.
- The modified data is known as the dump data. Replibyte writes this dump data to a datastore. A datastore can either reside on the local system or inside cloud storage. Along with the modified data, Replibyte creates an index file enumerating the conversions.
- When requested, Replibyte retrieves the modified data and the index file from the datastore. It parses the index file and decrypts or decompresses it as required, restoring the dump data.
- Replibyte copies the data to the destination database. The user can access this database like any other.
Some of the features and advantages of Replibyte include the following:
- It is relatively easy to install and use. Replibyte is lightweight and stateless and does not require its own server or daemon.
- It supports MySQL/MariaDB, PostgreSQL, and MongoDB as the source/destination database for the backup and restore procedures. The Replibyte Database Documentation page provides full information.
- It can store a datastore on either a local disk or in the cloud, including inside a Linode Object Storage solution. See the Replibyte Datastore information for more details about the possible cloud options.
- It supports a full complement of transformers. It can randomize a string, keep the first character only, or obfuscate data. It can also auto-generate an email address, first name, phone number, or credit card number. Fields can also be left at their original values to enable specific tests. Users are permitted to create custom transformers.
- It can work on large databases containing over 10GB of data.
- For PostgreSQL only, a database subset feature allows users to limit the number of entries in a database. This feature is not yet supported on MySQL.
- It uses Zlib for compression and AES-256 for encryption.
Replibyte does have a couple of limitations. As of this writing, it is not possible to copy the contents of the datastore directly into a local database, only a remote database. The only exception to this rule is if the local database instance is running inside a Docker container. The second limitation is the source database containing the original data and the destination database must have the same type. For instance, a transformed copy of a MySQL database can only be copied into another MySQL or MariaDB database.
What are the Replibyte Transformer Types?
Each transformer operates on the data in a different way. In most cases, a transformer randomizes, redacts, or hides the original data. However, the transient transformer is a “no-op” that leaves the original data intact. The transformer types are as follows.
email: Generates a valid email address.first-name: Changes the existing value to a valid first name.phone-number: Creates a valid phone number. This can only be applied to a string field, not an integer.random: Randomizes a string, maintaining the same string length.keep-first-char: Trims a value to its first letter only.credit-card: Generates a credit card number in the correct format. This transformer can only act on strings, not integers.redacted: Hides the original data using the*symbol.transient: Leaves the original data unaltered. This must be applied to keys and to other columns that must remain legible.
Transformers are applied on a per-column or per-key basis. The same transformer acts on the same column for all records inside a given table. For more information on transformers, along with examples of how to use them, see the Replibyte transformer documentation.
Before You Begin
If you have not already done so, create a Linode account and Compute Instance. See our Getting Started with Linode and Creating a Compute Instance guides.
Follow our Setting Up and Securing a Compute Instance guide to update your system. You may also wish to set the timezone, configure your hostname, create a limited user account, and harden SSH access.
sudo. If you are not familiar with the sudo command, see the
Linux Users and Groups guide.How to Install Replibyte
Replibyte is fairly easy to install. Although Replibyte is available for Windows, Linux, and MacOS, these instructions are geared toward Ubuntu 22.04 LTS users. However, they are generally applicable to all Linux distributions. To install Replibyte, follow these steps.
Update the system. Reboot it if necessary.
sudo apt-get update -y && sudo apt-get upgrade -yInstall the
jqutility.sudo apt install jqUse
curlto download the Replibyte archive. This command runs in the background and prints a message to the shell when it is done.curl -s https://api.github.com/repos/Qovery/replibyte/releases/latest | jq -r '.assets[].browser_download_url' | grep -i 'linux-musl.tar.gz$' | wget -qi - &Extract the archive.
tar zxf *.tar.gzMake the Replibyte application executable.
chmod +x replibyteMove the program to a system directory.
sudo mv replibyte /usr/local/bin/Enter the
replibyte -Vcommand to view the release number and confirm the application is installed correctly.replibyte -Vreplibyte 0.10.0To display the Replibyte help information, enter the
replibytecommand without any arguments.replibyteReplibyte 0.10.0 Replibyte is a tool to seed your databases with your production data while keeping sensitive data safe, just pass `-h` USAGE: replibyte [OPTIONS] --config <configuration file> <SUBCOMMAND> OPTIONS: -c, --config <configuration file> Replibyte configuration file -h, --help Print help information -n, --no-telemetry disable telemetry -V, --version Print version information SUBCOMMANDS: dump all dump commands help Print this message or the help of the given subcommand(s) source all source commands transformer all transformer commands
How to Configure Replibyte Using a YAML File
Before populating a new database, Replibyte must create a datastore from the source database. The source database contains the original data that serves as a template for the new data. The datastore contains the modified data. After a datastore is created, it can be used to seed a new database.
The conf.yaml file describes the source database along with a list of transformations to apply to the data. The transformations hide or sanitize any sensitive data in the original file. The YAML file also specifies information about the datastore. The datastore can be located either in the cloud or on the local disk. The destination database does not have to be specified when the datastore is created. The destination configuration is often added later when the data is required.
Identify the database to be transformed into seed data. To properly identify the database, the user name, password, host, port, and database name are required. This guide uses the local MariaDB database
sourcedb. This database contains a table namedpatients, which has the following table description.use sourcedb; desc patients;+------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+-------+ | userid | char(8) | YES | | NULL | | | first_name | varchar(20) | YES | | NULL | | | last_name | varchar(20) | YES | | NULL | | | phone | char(10) | YES | | NULL | | | email | varchar(30) | YES | | NULL | | | unit | varchar(20) | YES | | NULL | | | credit | char(16) | YES | | NULL | | | socialnum | char(9) | YES | | NULL | | +------------+-------------+------+-----+---------+-------+ 8 rows in set (0.001 sec)The
patientstable currently contains the following records.SELECT * FROM patients;+----------+------------+-----------+------------+------------------+---------+------------------+-----------+ | userid | first_name | last_name | phone | email | unit | credit | socialnum | +----------+------------+-----------+------------+------------------+---------+------------------+-----------+ | 13572468 | Bob | Jones | 1239876543 | bojones@isp1.com | Cardiac | 1122334455667788 | 222333444 | | 13572469 | Jack | Smith | 1239871234 | jacks@isp2.com | Neuro | 2232334344545565 | 343333666 | | 13572470 | John | Doe | 1234547359 | jjdoe43@isp4.com | Trauma | 3579468024683579 | 454454454 | +----------+------------+-----------+------------+------------------+---------+------------------+-----------+Create a new
conf.yamlfile on the local system.vi conf.yamlAdd the
sourceconfiguration, including the attributeconnection_uri. This value specifies the location of the source database. The value ofconnection_urimust follow the formatmysql://[user]:[password]@[host]:[port]/[database]. To access the local database, use127.0.0.1for thehost. For MariaDB or MySQL, the default port is3306. The example in this section uses theuseridaccount to access the source databasesourcedbfrom the local MariaDB application. Ensure the user has been granted access to the database. Substitute their user name foruseridand their actual password forpassword.Note To access a PostgreSQL or MongoDB database, the syntax is similar. For PostgreSQL, the correct syntax isconnection_uri: postgres://[user]:[password]@[host]:[port]/[database]. For MongoDB, use the formatconnection_uri: mongodb://[user]:[password]@[host]:[port]/[database].- File: conf.yaml
1 2source: connection_uri: mysql://userid:password@127.0.0.1:3306/sourcedb
Replibyte typically transforms at least some data fields. Inside the source section, add a
transformerskey, which accepts an array. The first value in the array uses thedatabasekey to indicate the database to transform. The second key is thetablekey. It contains the name of the table to modify. The following example illustrates the first section of thetransformerskey. It stipulates the transformations to apply to thepatientstable inside thesourcedbdatabase.Note The file spacing and alignment must be very precise. If the alignment of the keys is not correct, Replibyte cannot parse the file. In the following example, the keysdatabaseandtablemust align. It is possible to specify multiple tables using this formatting.- File: conf.yaml
1 2 3 4 5source: ... transformers: - database: sourcedb table: patients
After this section, add the
columnsinformation. The value ofcolumnsis an array of columns along with the transformation to apply to each column. The name of each column is indicated by thenamekey, while thetransformer_namekey specifies the transformer to use. Thetransformer_namemust reference one of the eight transformers mentioned in the “What are the Replibyte Transformer Types?” section of this guide.In the following example, Replibyte should apply the following transformations to the
patientstable inside thesourcedbdatabase.- The
first-nametransformer is applied to thefirst_namecolumn to generate a fake yet realistic first name. - The
randomtransformer acts upon thelast_namecolumn to generate an anonymous last name. - The
phone-numbertransformer converts thephonecolumn to a new random phone number in the correct format. - The
emailtransformer works on theemailcolumn to generate a fake email address. - The
keep-first-chartransformer is used on theunitcolumn. It retains the first letter of the unit, dropping the rest of the unit name. - The
credit-cardalters thecreditfield, generating a series of digits in the correct credit card format. - The
redactedfield modifies thesocialnumcolumn. It obscures most of the digits using X’s. - The
useridfield is left unchanged. To ensure this column is not altered, thetransienttransformer is applied.
The following example demonstrates how to add the
columnslist to theconf.yamlfile, along with a list of column names and the applicable transformers. Note thecolumnskey must align withdatabaseandtable, while thenameandtransformer_namekeys must align inside each column.- File: conf.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22source: ... transformers: - database: sourcedb table: patients columns: - name: userid transformer_name: transient - name: first_name transformer_name: first-name - name: last_name transformer_name: random - name: phone transformer_name: phone-number - name: email transformer_name: email - name: unit transformer_name: keep-first-char - name: credit transformer_name: credit-card - name: socialnum transformer_name: redacted
- The
Add a section describing the
datastore. This is where Replibyte stores the transformed SQL dump for later use. Begin this section with the keyworddatastore. To instruct Replibyte to store the data locally, add thelocal_diskkey. The value oflocal_diskis another key-value pair. Thedirkey indicates the name of the local directory. This directory must already exist before Replibyte is used. Within the YAML file, thedatastorekeyword must align with thesourcekeyword.Note For a full explanation of the configuration required to store the datastore in a cloud computing solution, see the Replibyte Datastore documentation.- File: conf.yaml
1 2 3 4 5source: ... datastore: local_disk: dir: /home/username/replibyte/data
To verify the syntax is correct, run the following command. It should list all available transformers for the selected source and should not display any errors.
replibyte -c conf.yaml transformer listThe entire
conf.yamlfile should appear similar to the following sample file.- File: conf.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25source: connection_uri: mysql://userid:password@127.0.0.1:3306/sourcedb transformers: - database: sourcedb table: patients columns: - name: userid transformer_name: transient - name: first_name transformer_name: first-name - name: last_name transformer_name: random - name: phone transformer_name: phone-number - name: email transformer_name: email - name: unit transformer_name: keep-first-char - name: credit transformer_name: credit-card - name: socialnum transformer_name: redacted datastore: local_disk: dir: /home/username/replibyte/data
How to Use Replibyte to Create and Restore an Anonymous Database Dump
To create a database dump, the conf.yaml file must already exist. The source and datastore components of the file must be fully defined. Additionally, the local storage directory or cloud computing location must already exist. The destination database must be of the same type as the source database. For instance, if Replibyte extracted and transformed data from a MySQL database, the destination database must also be a MySQL/MariaDB database.
To create and then restore a transformed database, follow these steps.
dump.sql file and then import the sql file into a database.To create the transformed data and save it to the datastore, use the Replibyte
dump createcommand. Replibyte displays a progress bar and indicates when it has finished.Note This guide creates the datastore from the source database. This is the most usual approach. It is also possible to base the dump database on the output ofmysqldumpor another similar command. For more information on this approach, see the Replibyte documentation.replibyte -c conf.yaml dump create[00:00:00] [#####################################################################] 2.76KiB/2.76KiB (0s)Confirm the
dumpandmetadata.jsonfiles now exist inside thedatastoretarget directory.ls replibyte/data/dump-1677767483822 metadata.jsonTo restore a modified database, add the
destinationinformation to theconf.yamlfile. Add theconnection_urikey and value for the destination database the same way thesourceconfiguration is specified. Theconnection_urifor the destination also uses the formatmysql://<user>:<password>@<host>:<port>/<database>. The following example writes the transformed data to theseedMariaDB database on another system. Replaceremote_ip_addrwith the IP address of the remote system. Replaceuseridandpasswordwith the account details for the remote user.- File: conf.yaml
1 2 3 4source: ... destination: connection_uri: mysql://userid:password@remote_ip_addr:3306/seed
Before restoring the database contents to the destination database, use the
dump listcommand to see all versions of the file inside the datastore.replibyte -c conf.yaml dump listname | size | when | compressed | encrypted --------------------+-----------+----------------+------------+----------- dump-1677767483822 | 982 Bytes | 28 minutes ago | true | falseTo write the contents of the datastore to a remote database, use the
dump restorecommand and theremoteoption. To restore based on the latest version in the datastore, use-v latest.Note Replibyte cannot write to a local database unless it is running inside a Docker container. To write the SQL dump to a local file, use the commandreplibyte -c conf.yaml dump restore local -v latest -o > dump.sql.replibyte -c conf.yaml dump restore remote -v latestRestore successful!To confirm the sensitive data in the destination database is appropriately hidden, access the remote system. Launch the database, and dump the description and contents of the
patientstable. The table definition should be the same, but the contents should be transformed into unrecognizable data. As expected, theuseridfield remains unchanged.SELECT * FROM patients;+----------+------------+-----------+------------+---------------------+------+------------------+-----------+ userid | first_name | last_name | phone | email | unit | credit | socialnum | ----------+------------+-----------+------------+---------------------+------+------------------+------------+ | 13572468 | Stephany | 9sxCe | (761) 618- | alaina@example.net | C | 4574727772422 | 222****** | | 13572469 | Ola | sREkQ | 124-880-12 | sincere@example.org | N | 4233777453337 | 343****** | | 13572470 | Sister | nNo | 1-590-599- | modesto@example.net | T | 5185478832614352 | 454****** | +----------+------------+-----------+------------+---------------------+------+------------------+-----------+To delete a dump from the datastore, use the
dump deletecommand and the name of the dump. It is also possible to keep only the ten most recent files using the commandreplibyte -c conf.yaml dump delete --keep-last=10.replibyte -c conf.yaml dump delete [dumpid]Dump deleted!
Conclusion
Replibyte transforms the data in a production database to help protect sensitive data. Several databases are supported, including MySQL/MariaDB, PostgreSQL, and MongoDB. Replibyte can create random strings, redact information, and create valid fake names, emails, and credit card numbers. To install Replibyte, download the Replibyte archive from GitHub. Then create a YAML file indicating the source database and the transformations to perform. Replibyte transfers the transformed database dump to a datastore, which can be stored locally or in the cloud. Replibyte can later transfer the data from the datastore to a destination database. For more information on Replibyte, see the Official Replibyte documentation.
More Information
You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.
This page was originally published on