pret
pret, which stands for “Programmable ETL” (and also means ready in French), is the tool that enables you to map arbitrary datasets onto the CANDEL schema, and allows you to request a new database and load data into it
Please use the sidebar to navigate to instructions about using the pret CLI as well as detailed documentation on the directives that specify how data is processed by pret.
If you prefer to learn by looking at real-world practical examples, please go to the Importing data
Environment setup and installation
Prerequisites
pret
requires the following installed:
- Java version 11. Also likely compatible with Java 1.8 or 1.9, but this is not guaranteed. See OpenJDK for installation instructions.
- The Google Cloud SDK
Configure your access credentials with the Google Cloud CLI by running:
gcloud auth application-default login
at a terminal and following the login procedure.
Installing pret
You can retrieve the latest version of pret
by running the following command at the terminal:
gsutil cp gs://pret-releases/pret-$VERSION.zip .
Where $VERSION
is the version of pret
you want. To see all releases use gsutil ls gs://pret-releases
. Normally you will want the latest version. Note that version numbers do not have leading zeroes, so pret-1.0.9
may appear after pret-1.0.35
, but in that case pret-1.0.35
is the more recent version.
After you’ve downloaded the latest version of pret, unzip the archive:
unzip pret-$VERSION.zip
And then change into the directory you just created.
cd pret-$VERSION
Running
Invoke pret
from inside the unzipped directory with:
./pret
in linux or macOS
pretw.bat
in Windows
This will echo the command line usage options (For details, see below). For instance, to execute the prepare
task, you would run:
./pret prepare --import-config /path/to/config.edn --working-directory /path/to/working-dir/
pret CLI usage and options
The pret
command line utility (CLI) is run as follows (Using mac/linux script. For Windows usage see above.):
./pret <task> <options>
Run pret
with no arguments, ie ./pret
, for a full list of commands and arguments.
pret
can be run to execute the following five primary tasks:
Task | Description |
---|---|
login | Confirms identity via Google and issues an authentication token. Requires --email arg. Required before execution of request-db , transact , and validate tasks. |
request-db | Creates a new branch database that is a copy of the master CANDEL database. User must be logged in. Use this task when you begin an import, or after a failed partial transaction, to start with a clean database. Returns the name of the branch database. This request takes a few minutes to fulfill and will notify you by email when complete. |
request-empty-db | Creates a new database with bootstrapped reference data only (does not inclue other datasets). This can be helpful for getting an initial import working, but provides no information about whether or not e.g. any reference data in the data imported into it will conflict or overlap with data in the existing master database. |
prepare | Takes your import config and input data files and prepares them for transacting into the database. Requires the import config and working directory options (see below). This task also performs numerous validations on your config and data files and will emit an error if your files don’t pass these validations. Note that this task can be run repeatedly to iterate on correct formatting of config and import data - just be sure to empty out your working directory if you have some processed files in it. pret will emit an error if you don’t do that. |
transact | Take your prepared data and transact it into a database. Requires --database argument with name previously used for provisioning a branch or empty db, as well as the --working directory argument. You’ll want to run this task against the database you spun up using the request-db task. You can also transact data created with diff by using the update arg. |
diff | The diff workflow can be used to stage a subset of prepared data for transaction. See the documentation for the diff workflow below. |
validate | After transacting your data successfully, this task will perform additional validations on the data which will determine whether e.g. measurements have all necessary attributes, or that references in the dataset refer to other entities that have been successfully transacted. |
The options for these tasks are as follows:
Option | Description |
---|---|
--email USER-EMAIL | User email. Identifier for google backed domain (e.g. gmail, parkerici.org, or other allowed google org), used for login. |
--import-config IMPORT-CONFIG | Import Config edn file (required for prepare) |
--working-directory IMPORT-WORKING-DIRECTORY | Directory where prepared data goes, transact uses data prepare puts here (required for prepare and transact) |
--tx-batch-size TX-BLOCK-SIZE | Datomic transaction batch size (defaults to 50), most of the time you will not need to set this. |
-h, --help | Help function. |
Example usage
The following example would log in and then request a branch database, with the name candel-db-123
specified. Next the commands prepare data as specified in ~/repos/pret-datasets/tcga/config.edn
to the working directory ~/data/tcga-import/tmp-working
, and then transact the data into the database created in the request-db
task, and validate the transacted data.
./pret login --email bestuser@parkerici.org
./pret request-db --database candel-db-123
OUTPUT: Request successful, created database: candel-db-123
./pret prepare --import-config ~/repos/pret-datasets/tcga/config.edn --working-directory ~/data/tcga-import/tmp-working
./pret transact --database candel-db-123 --working-directory ~/data/tcga-import/tmp-working
./pret validate --database candel-db-123 --working-directory ~/data/tcga-import/tmp-working
Diff merge documentation
See the full diff workflow documentation here