R pre-processing

This page contains information about pre-processing different types of data for usage with pret by leveraging the workerbee R package

Available generators
Timepoints and treatment regimens
Sequencing data from Personalis
Flow and CyTOF data
1. Gated data
2. Clustered data
Mapping gene symbols

Available generators

The currently available generators are the following (please refer to the R documentation for detailed information)

Function	Purpose
`generate_regimens_therapies`	generate treatment regimens and therapies starting from a table of subjects information
`generate_timepoints`	combine timepoint and treatment regimens information to generate all the individual timepoint / treatment regimen combinations
`generate_cell_population_measurements`	generate cell population measurements starting from a table of event counts downloaded from CellEngine
`generate_cell_cluster_measurements`	generate cluster measurements starting from the output of the `grappolo` package
`generate_clinical_observations`	generate clinical observations by combining clinical data with timepoint information
`generate_personalis_data`	generate multiple types of data from Personalis output

Timepoints and treatment regimens

This example assumes a simple scenario where there a number of treatment regimens, all with the same schedule of events

Assuming you have a table of subjects indicating the arm (i.e. treatment regimen) for each one (the TRTACD column in the example below)

USUBJID	SEX	TRTACD
840-100100-001	M	B1
840-100100-002	F	B2
840-100100-003	F	B2
840-100100-004	M	C1

you can create treatment regimens and therapies as follows

subjects <- read.table("original/subjects.txt", header = T, stringsAsFactors = F)

reg.therapies <- workerbee::generate_regimens_therapies(subjects,
    subject.var = "USUBJID",
    treatment.var = "TRTACD"
)

The reg.therapies object is a list with two elements

therapies contains the therapy information for each subject (i.e. the assignment of each subj to a treatment arm)
regimens contains the regimen ids

Now create a schedule of events file similar to the following

CYCLE	DAY	NOMINAL	ORDER	TYPE
1	1	C1D1	1	:timepoint.type/on-treatment
1	3	C1D3	2	:timepoint.type/on-treatment
1	4	C1D4	3	:timepoint.type/on-treatment
1	8	C1D8	4	:timepoint.type/on-treatment

You can then combine the schedule of events with the regimens information as follows

schedule.of.events <- read.table("original/schedule_of_events.txt", header = T, stringsAsFactors = F) 
timepoints <- workerbee::generate_timepoints(schedule.of.events,
    regimens = reg.therapies$regimens$id,
    timepoint.name.var = "NOMINAL"    
)

This will generate timepoints by duplicating the schedule of events for each treatment regimens as follows

regimen	CYCLE	DAY	ORDER	TYPE	id
B1	1	1	1	:timepoint.type/on-treatment	B1/C1D1
B2	1	1	1	:timepoint.type/on-treatment	B2/C1D1
C1	1	1	1	:timepoint.type/on-treatment	C1/C1D1
B1	1	3	2	:timepoint.type/on-treatment	B1/C1D3
B2	1	3	2	:timepoint.type/on-treatment	B2/C1D3
C1	1	3	2	:timepoint.type/on-treatment	C1/C1D3

Now save everything as follows

write.table(timepoints, "processed/timepoints.txt", sep = "\t", col.names = T, row.names = F, quote = F)
write.table(reg.therapies$therapies, "processed/therapies.txt", sep = "\t", col.names = T, row.names = F, quote = F)
write.table(reg.therapies$regimens, "processed/treatment-regimens.txt", sep = "\t", col.names = T, row.names = F, quote = F)

Sequencing data from Personalis

The generate_personalis_data is a high-level entry point that will generate multiple types of data by calling a number of downstream generators. You can control which generators are called with the generators argument. Different types of generators assume the presence of files with different names in the input.dir folder. You should grab files matching the following regular expressions from the Personalis output folder and copy them to input.dir

generator	required files
`gene_expression`	`tumor_rna_expression_report`
`cna`	`somatic_dna_gene_cna_report`
`tmb`	`dna_statistics`
`variants`	`somatic_dna_small_variant_report_preferred`
`tcr_alpha`	`rna_tcr_alpha_clone_report`
`tcr_beta`	`rna_tcr_beta_clone_report`

The Personalis data files contain the sample name in the file name. Most likely you will have to supply a mapping between the file names and the actual sample barcodes.

Assuming you have a table that contains mapping between filenames and barcodes, such as a table downloaded from RawSugar after the matching process, you can create this mapping as follows

name.to.sample <- read.table("filename_to_sample.txt", header = T, stringsAsFactors = F)

samples.map <- as.vector(name.to.sample$sample.barcode)
names(samples.map) <- name.to.sample$filename

Alternatively, instead of the full filename you can also specify a sample ID that will be extracted from the filename.

You can then use the generate_personalis_data function as follows

wick::set_dbname("your-db-name")
all.genes <- wick::get_all_genes()

workerbee::generate_personalis_data(
    input.dir = "original/personalis",
    output.dir = "processed/personalis"
    assembly = "GRCh38",
    all.genes = all.genes,
    samples.map = samples.map,
    generators = c("cna", "variants", "gene_expression", "tmb", "tcr_alpha", "tcr_beta")
)

This function will process your data, save it in the folder specified in the output.dir argument, and also output snippets that you can include in your config.edn file.

Flow and CyTOF data

For flow and CyTOF data the default values for the column names arguments in the generators functions described below will work for data that has been downloaded from CellEngine or clustered using grappolo, otherwise you will have to provide these arguments yourself (refer to the R documentation)

Gated data

For gated data you will have to create a table with cell population names, and how they map to standardized cell ontology names, similar to the following (see the Schema section for more details on representing cell populations in CANDEL).

name	cell.type	positive.epitopes
Plasmablast	plasmablast	NA
NK cells	natural killer cell	NA
CD4 T Cells	CD4-positive, alpha-beta T cell	NA
CD4 T Cells > CD25+	CD4-positive, alpha-beta T cell	IL2RA

Remember that when specifying cell populations in your config.edn file you will also need to indicate that they do not come from clustering, like in the following example:

:cell-populations [{:pret/input-file "processed/cell_populations.txt"
                    :pret/na "NA"
                    :name "name"
                    :positive-markers "positive.epitopes"
                    :cell-type "cell.type"
                    :pret/constants {:from-clustering false}}]

Cell populations abundance from gated data

Assuming you have a table of event counts for each sample and population you can use the generate_cell_population_measurements function. You will have to specify which population you want to use for normalization of event counts(e.g. CD45+ in the following example). Note that the function will also output a snippet that you can include in your config.edn file

tab <- read.table("original/event_count.txt", header = T, sep = "\t", check.names = F)
# This is optional, to limit the results to a pre-specified set of cell populations
cell.populations.mapping <- read.table("original/cell_populations_mapping.txt", header = T, 
                                        sep = "\t", check.names = F)

workerbee::generate_cell_population_abundance_measurements(
    tab,
    norm.population = "CD45+",
    out.file = "processed/event_count.txt",
    all.cell.populations = cell.populations.mapping$name
)

Cell populations abundance as percent of parent

If you want to import percent of parent data you can do so by starting from a table that contains percent of parent data already (as opposed to event counts), and skipping any normalization, as follows (refer to the documentation for more details)

tab <- read.table("original/percent_parent.txt", header = T, sep = "\t", check.names = F)
# This is optional, to limit the results to a pre-specified set of cell populations
cell.populations.mapping <- read.table("original/cell_populations_mapping.txt", header = T, 
                                        sep = "\t", check.names = F)

workerbee::generate_cell_population_abundance_measurements(
    tab,
    out.file = "processed/percent_parent.txt",
    measurement.type = "percent-of-parent",
    divide.by.100 = TRUE, # Percentages in CANDEL are represented in [0, 1]
    all.cell.populations = cell.populations.mapping$name
)

Marker expression measurements from gated data

You will have to provide a table mapping reagent names in the data to epitopes in CANDEL. The easiest way to do this is by creating a TSV file with this information, similar to the following

reagent	epitope
113In_CD40	TNR5
115In_CD20	CD20
141Pr_CD3	CD3E
142Nd_CD19	CD19
143Nd_CD117_c-kit	KIT

and pass it as input to the generator as exemplified below

tab <- read.table("original/marker_expression.txt", header = T, sep = "\t", check.names = F)
reagents.mapping <- read.table("original/reagents_mapping.txt", header = T, sep = "\t", check.names = F)

# This is optional, to limit the results to a pre-specified set of cell populations
cell.populations.mapping <- read.table("original/cell_populations_mapping.txt", header = T,
                                        sep = "\t", check.names = F)

# This is also optional, to limit the results to epitopes that are present in the CANDEL database
wick::set_dbname("YOUR-DATABASE")
all.epitopes <- wick::get_all_epitopes()

generate_cell_population_epitope_measurements(
    tab, 
    reagents.mapping = reagents.mapping, 
    all.epitopes = all.epitopes, 
    all.cell.populations = cell.populations.mapping$name,
    out.file = "processed/marker_expression.txt"
)

Clustered data

In the case of clustered data both cell abundance and marker expression measurements will be generated with a single function call. Similarly to the above, it is recommended to provide in input a table specifying mapping between reagents and epitopes in the database

tab <- read.table("original/data.clustered.txt", header = T, sep = "\t", check.names = F)
reagents.mapping <- read.table("original/cytof_reagents_mapping.txt", header = T, sep = "\t")

# This is optional, to limit the results to epitopes that are present in the CANDEL database
wick::set_dbname("YOUR-DATABASE")
all.epitopes <- wick::get_all_epitopes()

generate_cell_cluster_measurements(
    tab, 
    reagents.mapping = reagents.mapping, 
    all.epitopes = all.epitopes,
    out.file = "processed/clusters_data.txt"
)

Mapping gene symbols

If you your data contains gene symbols, wick contains a function to map strings to HGNC symbols. You can perform the mapping as follows:

library(wick)

set_dbname("YOUR-DATBASE")

all.genes <- get_all_genes()

mapping.res <- map_gene_symbols(all.genes, vector.of.symbols.to.map)

Note that the function will return NA for symbols that cannot be mapped