How-to guides
Migrating your plaintext data to be fully encrypted
This how to guide explains how to migrate your plaintext data to be fully encrypted with CipherStash. It also explains the different encryption modes, and how they work. Finally, it explains how to clean up your plaintext data.
Introduction
You have plaintext data you want to protect. CipherStash helps you encrypt your data in your existing database, while keeping it searchable.
But what are the steps you need to follow to go from plaintext to fully encrypted? CipherStash provides a way to do this without any application downtime, while keeping the encrypted and plaintext field values consistent.
CipherStash uses encryption modes to incrementally migrate your plaintext to ciphertext
Say you have a patients
table in your application that has an email
column you want to protect. All the values in that email
column are currently in plaintext. To encrypt that email
column, you need to create two additional encrypted columns:
- A "source" column, which contains the encrypted value from an AES-GCM encryption operation
- An "index" column, which contains an encrypted index that can be used to search and order the encrypted value
CipherStash Proxy can operate in four modes:
plaintext
— the plaintext column is used for all CRUD operations.plaintext-duplicate
— the plaintext column is used for all CRUD operations, butINSERTs
andUPDATEs
are also written to the encrypted columns.encrypted-duplicate
— the encrypted columns are used for all CRUD operations, butINSERTs
andUPDATEs
are also written to the plaintext column.encrypted
— the encrypted columns are used for all CRUD operations.
While it's possible to do a knife-switch cutover from plaintext
to encrypted
modes, we recommend CipherStash users migrate their data by stepping through the modes incrementally. This builds confidence in the performance and availability of the encryption, and gives you opportunity to roll back.
This also allows for zero-downtime migrations.
Understanding the different encryption modes
When integrating CipherStash into your application, you create a dataset config which describes which fields you want to encrypt, what encrypted indexes to enable, and what encryption mode
to use for each field.
The following is an example of a dataset config with a single field configured:
1# dataset.yml
2tables:
3 - path: patients
4 fields:
5 - name: email
6 in_place: false
7 cast_type: utf8-str
8 mode: plaintext-duplicate
9 indexes:
10 - version: 1
11 kind: match
12 tokenizer:
13 kind: ngram
14 token_length: 3
15 token_filters:
16 - kind: downcase
17 k: 6
18 m: 2048
19 include_original: true
20 - version: 1
21 kind: ore
22 - version: 1
23 kind: unique
From the config, the field email
of the type utf8-str
is configured with match
, ore
and unique
indexes. This field should be represented by the following database columns:
column | description |
---|---|
plaintext source value for email | |
__email_encrypted | encrypted source value for email |
__email_match | encrypted match index for email |
__email_ore | encrypted ore index for email |
__email_unique | encrypted unique index for email |
The encryption mode
for this field is set to plaintext-duplicate
.
Each field's encryption mode
can have one of the following values:
plaintext-duplicate
encrypted-duplicate
encrypted
CipherStash Proxy is responsible for database reads, writes, query mapping, encryption, and decryption. Each encryption mode affects how Proxy behaves.
Encryption Mode: plaintext-duplicate
column | description | operations |
---|---|---|
plaintext source value for email | reads, writes, lookups | |
__email_encrypted | encrypted source value for email | writes |
__email_match | encrypted match index for email | writes |
__email_ore | encrypted ore index for email | writes |
__email_unique | encrypted unique index for email | writes |
In plaintext-duplicate
mode, CipherStash Proxy uses the plaintext source column for all read operations. However, Proxy writes to the plaintext source, encrypted source, and the encrypted index columns.
Figure: Mapping of SELECT statements and WHERE clauses for a field in plaintext-duplicate mode
CipherStash Proxy handles SELECT
statements and WHERE
clauses as follows:
- When selecting for the field's value, Proxy retrieves the plaintext value from the plaintext column directly.
- When the field is used as a filter in a
WHERE
clause, Proxy will pass that through to the database. - In
plaintext-duplicate
mode, the Proxy does not use the encrypted source and index columns for read operations.
This works in the same way as plain SQL without CipherStash Proxy.
Figure: Mapping of INSERT and UPDATE statements for a field in
plaintext-duplicate
mode
CipherStash Proxy handles DB INSERT
and UPDATE
statements as follows. When inserting a new row, or updating the field, Proxy will encrypt the field's value and encrypted index. Proxy will then write the plaintext value, the encrypted value, and the encrypted index values into each column respectively.
This encryption mode is used when first setting up CipherStash on an existing plaintext field. When initially adding the encrypted source column, and the encrypted indexes to the database their values have not been populated yet. This mode allows the application to keep working with the plaintext values during reads, and incrementally write the encrypted values, while also allowing you to bulk encrypt every record in the database in the background without causing application downtime.
Encryption Mode: encrypted-duplicate
In the following example, the encryption mode on the dob
field is set to encrypted-duplicate
1# dataset.yml
2tables:
3 - path: patients
4 fields:
5 - name: email
6 in_place: false
7 cast_type: utf8-str
8 mode: encrypted-duplicate
9 indexes:
10 - version: 1
11 kind: match
12 tokenizer:
13 kind: ngram
14 token_length: 3
15 token_filters:
16 - kind: downcase
17 k: 6
18 m: 2048
19 include_original: true
20 - version: 1
21 kind: ore
22 - version: 1
23 kind: unique
column | description | operations |
---|---|---|
plaintext source value for email | writes | |
__email_encrypted | encrypted source value for email | reads, writes |
__email_match | encrypted match index for email | writes, match lookups (LIKE ) |
__email_ore | encrypted ore index for email | writes, ORE lookups (< , <= , > , >= ) |
__email_unique | encrypted unique index for email | writes, exact lookups (= ) |
In encrypted-duplicate
mode, CipherStash Proxy:
- Uses the encrypted source column for retrieving the field value.
- Uses the encrypted index columns for lookups by this field.
- Writes to the plaintext source column, encrypted source column, and all index columns, in the same way as in the
plaintext-duplicate
mode.
Figure: Mapping of
SELECT
statements andWHERE
clauses for a field inencrypted-duplicate
mode
CipherStash Proxy handles SELECT
statements and WHERE
clauses as follows:
- When selecting for the field's value, Proxy retrieves the ciphertext from the encrypted source column, and decrypts it.
- When the field is used as a filter in a
WHERE
clause, Proxy will encrypt the filter term, and map the filter to use the appropriate encrypted index column for the filter comparison.
To show how this works in practice, per the diagram above:
- The clause
email LIKE 'user%'
is mapped to__email_match @> xxx
, wherexxx
is the encrypted match term. - The clause
email >= 'user@example.com'
is mapped to__email_ore > xxx
, wherexxx
is the encrypted ore term. - The clause
email = 'user@example.com'
is mapped to__email_unique = xxx
, wherexxx
is the encrypted exact term.
Because this relies on the encrypted column and the encrypted index columns, the query will not work correctly if the table contains rows that have not been encrypted. This applies to rows where the encrypted source column and index columns are NULL
. Such rows might incorrectly appear with NULL
values, or might not appear in queries that filter based on the encrypted index.
Additionally, queries can fail if the required encrypted index columns are not configured. In the example above, supposed that the match index was left out, if you try to do a LIKE
query on this field, CipherStash Proxy will fail to map the query because a match
index is required.
Figure: Mapping of
INSERT
andUPDATE
statements for a field inencrypted-duplicate
mode
In encrypted-duplicate
mode, CipherStash Proxy handles DB INSERT
and UPDATE
statements similar to plaintext-duplicate
mode. When inserting a new row, or updating the field, Proxy will encrypt the field's value and encrypted index. Proxy will then insert the plaintext value, the encrypted value, and the encrypted index values into each column respectively.
The encrypted-duplicate
mode is used when all records in the table have been fully encrypted, and there are no more encrypted source column or index columns with NULL
values. This mode makes the application operate fully on encrypted reads and writes, while still keeping the value in the plaintext column consistent and updated. This allows for the application to be tested extensively with fully encrypted read-writes. If there is a bug or broken functionality discovered at this stage, the application can easily be reverted to the plaintext-duplicate
mode.
Encryption mode: encrypted
In the following example, the field dob
encryption mode is set to encrypted
:
1# dataset.yml
2tables:
3 - path: patients
4 fields:
5 - name: email
6 in_place: false
7 cast_type: utf8-str
8 mode: encrypted
9 indexes:
10 - version: 1
11 kind: match
12 tokenizer:
13 kind: ngram
14 token_length: 3
15 token_filters:
16 - kind: downcase
17 k: 6
18 m: 2048
19 include_original: true
20 - version: 1
21 kind: ore
22 - version: 1
23 kind: unique
column | description | operations |
---|---|---|
__email_encrypted | encrypted source value for email | reads, writes |
__email_match | encrypted match index for email | writes, match lookups (LIKE ) |
__email_ore | encrypted ore index for email | writes, ore lookups (< , <= , > , >= ) |
__email_unique | encrypted unique index for email | writes, exact lookups (= ) |
In encrypted
mode, CipherStash Proxy handles reads and writes similar to encrypted-duplicate
mode:
- Proxy uses the encrypted source column for retrieving the field value.
- Proxy uses the encrypted index columns for query filters.
- Proxy writes to the encrypted source and the encrypted index columns.
The only difference between the encrypted
and encrypted-duplicate
mode is that in the encrypted
mode, CipherStash Proxy no longer writes to the plaintext column.
Figure: Mapping of
SELECT
statements andWHERE
clauses for a field inencrypted
mode
CipherStash Proxy handles SELECT
statements and WHERE
clauses similar to encrypted-duplicate
mode. When selecting for the field's value, Proxy retrieves the ciphertext from the encrypted source column, and decrypts it. When the field is used as a filter in a WHERE
clause, Proxy will encrypt the filter term, and map the filter to use the appropriate encrypted index column for the filter comparison.
Figure: Mapping of
INSERT
andUPDATE
statements for a field inencrypted
mode
In this mode, CipherStash Proxy handles database INSERT
and UPDATE
statements similar to plaintext-duplicate
and encrypted-duplicate
modes. However, Proxy no longer writes into the plaintext column.
When inserting a new row, or updating the field, CipherStash Proxy will encrypt the field's value and encrypted index. Proxy will then write the encrypted value and the encrypted index values into each column respectively.
This mode is used when the application has been verified to work correctly with CipherStash in encrypted-duplicate
mode. When encrypted
mode is used, the plaintext column is no longer used, and can be safely removed from the table.
Migrating a field from plaintext-duplicate
mode to encrypted
Prerequisites
Before starting, please ensure that you:
- Have already created a CipherStash account (you can do this with the
stash signup
command) - Have already installed Stash CLI in your local development environment
- Are connected to the Internet
- Have an application that integrates with CipherStash. You can use our example application to get started.
- Have an existing plaintext field that has been configured in
plaintext-duplicate
mode.
For this example, let's call this field email
on the users
table.
Example: migrating users.email
from plaintext-duplicate
to encrypted
For this example, let's assume that you have a field email
on the table users
, that have been configured in plaintext-duplicate
mode. This field has the following configuration in your dataset config file.
1# dataset.yml
2tables:
3 # other tables and fields above here
4 - path: users
5 fields:
6 - name: email
7 in_place: false
8 cast_type: utf8-str
9 mode: plaintext-duplicate
10 indexes:
11 - version: 1
12 kind: match
13 tokenizer:
14 kind: ngram
15 token_length: 3
16 token_filters:
17 - kind: downcase
18 k: 6
19 m: 2048
20 include_original: true
21 - version: 1
22 kind: ore
23 - version: 1
24 kind: unique
We will also assume that the following columns have been created in your database:
email
__email_encrypted
__email_match
__email_ore
__email_unique
Encrypting all the records
In plaintext-duplicate
mode, your application will write to all plaintext, encrypted, and index columns, while only reading from the plaintext column. Because of this, the application will work without issues even when retrieving records that have not been encrypted yet.
However, the application will not function correctly if it is switched to one of the other modes (plaintext-duplicate
, encrypted-duplicate
, or encrypted
) at this stage. Before switching this field to either encrypted-duplicate
or encrypted
mode, we need to encrypt all records in the table.
The steps to take depends on which type of application you are working with.
Migrating to encrypted
mode
At this point, all the records on the table have been encrypted. You have also verified that your application still works when you switch the users.email
field to encrypted-duplicate
mode. Your application is now ready to switch this field to encrypted
mode.
As before, update the encryption mode in the dataset config file to encrypted
.
1# dataset.yml
2tables:
3 # snip
4 - path: users
5 fields:
6 - name: email
7 in_place: false
8 cast_type: utf8-str
9 mode: encrypted # updated
10# snip
Next, upload the new dataset config to CipherStash.
1stash datasets config upload --file dataset.yml --client-id $CS_CLIENT_ID --client-key $CS_CLIENT_KEY
Finally, restart your application.
Once your application restarts, it will no longer write any values to the plaintext column you started with. All reads and writes should only depend on the encrypted source column and the encrypted index columns.
Cleaning up the plaintext column
At this point, it should be safe to drop the plaintext column from the database. The steps will depend on how database migrations are handled in your application.