Expression Functions¶

SolveBio expressions can use Python-like functions to pull data from any dataset, calculate statistics, or run advanced algorithms.

Function List¶

All available functions are listed below:

Function	Data Type	Description
annotate	object	Annotate a record with a template. `annotate(record, template, debug, include_errors)` Learn more →
beacon	object	Retrieves the beacon results for any entity. `beacon(entity, entity_type, beacon_set, datasets, visibility)` Learn more →
classify_variant	object	Classify a variant using one of multiple classifiers. `classify_variant(variant, classifier)` Learn more →
coerce_list	auto (list)	Coerce a value to a list. Single items will become a single value list. Lists will remain lists. None will return an empty list. `coerce_list(value)` Learn more →
concat	string	Combine text from multiple lists or strings. `concat(values, delimiter)` Learn more →
crossmap	string	Convert a variant or genomic region entity between different genome builds using the Ensembl CrossMap tool. The functionality of this expression is the same as UCSC's liftOver tool. `crossmap(entity, target_build)` Learn more →
dataset_count	integer	Calculate the total number of results (or "hits") for a given query. Returns the number of results. `dataset_count(dataset, entities, filters, query)` Learn more →
dataset_entity_top_terms	string (list)	Retrieve the top entities for any entity field in a dataset. Returns a list of strings, in order of occurrence or None if the dataset can not be queried by this entity. `dataset_entity_top_terms(dataset, entity, limit, filters, query)` Learn more →
dataset_field_percentiles	object	Calculates the percentiles for any integer field. Returns an object containing the desired percentiles. `dataset_field_percentiles(dataset, field, percents, entities, filters, query)` Learn more →
dataset_field_stats	object	Calculates statistics for any numeric field. Returns an object containing field statistics. `dataset_field_stats(dataset, field, entities, filters, query)` Learn more →
dataset_field_terms_count	integer	Retrieve the number of unique terms for any string field in a dataset. Returns the number of unique terms. `dataset_field_terms_count(dataset, field, entities, filters, query)` Learn more →
dataset_field_top_terms	object (list)	Retrieve the top terms for any string field in a dataset. Returns a list of objects containing the term and number of times it occurs, in order of occurrence. `dataset_field_top_terms(dataset, field, limit, entities, filters, query)` Learn more →
dataset_field_values	auto (list)	Retrieves a list of non-empty values for a dataset field. Returns a list of values from the specified field. `dataset_field_values(dataset, field, limit, entities, filters, query)` Learn more →
dataset_query	object (list)	Query any dataset with optional filters and/or entities. Returns a list of results. `dataset_query(dataset, fields, limit, entities, filters, query)` Learn more →
datetime_format	string	Format datetime strings. By default, it returns an ISO 8601 format date time string. To override, provide an optional input_format or output_format to be used. `datetime_format(value, input_format, output_format)` Learn more →
entity_ids	string	Retrieve one or more normalized entity IDs for a query. `entity_ids(entity_type, entity)` Learn more →
error	error	Raise a FunctionError `error(message)` Learn more →
explode	object (list)	Split N values from M list fields into N records. If _id is in the original record, each new record will have an integer appended to the _id with the index of each exploded record. `explode(record, fields)` Learn more →
findall	string (list)	Returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, returns a list of groups. `findall(pattern, string, regex_ignorecase, regex_dotall, regex_multiline)` Learn more →
genomic_sequence	string	Retrieves a specific sequence from the genome. `genomic_sequence(genomic_region)` Learn more →
get	auto	Get the value at any depth of a nested object based on the path described by `path`. If path doesn't exist, `default` is returned. `get(obj, path, default)` Learn more →
melt	object (list)	Convert a wide dataset to a long dataset by "melting" one or more fields into "key" and "value" fields. All fields must have the same data type. `melt(record, fields, key_field, value_field, melt_list_values)` Learn more →
normalize_aa_change	string	Normalize an amino acid change (beta) `normalize_aa_change(aa_change, ref, alt)` Learn more →
normalize_variant	string	Normalize a variant ID (minimal representation and left shifting). `normalize_variant(variant)` Learn more →
now	string	Retrieves the current date and time. `now(timezone, template)` Learn more →
predict_variant_effects	object (list)	Predict the effects of a variant using Veppy. `predict_variant_effects(variant, default_transcript, gene_model)` Learn more →
prevalence	double	Calculates the frequency that a value occurs within a population. Typically used to calculate the prevalence of variants or genes across samples in a dataset. Returns the frequency of occurrence. Please note: in large datasets the result is approximate and can have an error of up to 5%. `prevalence(dataset, entity, sample_field, filters)` Learn more →
search	boolean	Scan through string looking for the first location where the regular expression pattern produces a match. Returns True on a match and False if no position in the string matches the pattern. `search(pattern, string, regex_ignorecase, regex_dotall, regex_multiline)` Learn more →
search_groups	string (list)	Scan through string looking for the first location where the regular expression pattern produces a match. Returns a list of strings corresponding to the groups in the pattern. `search_groups(pattern, string, regex_ignorecase, regex_dotall, regex_multiline)` Learn more →
split	string (list)	Split text based on a delimiter and optionally strip whitespace. `split(value, delimiter, regex, strip, regex_ignorecase, regex_dotall, regex_multiline)` Learn more →
sub	string	Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn't found, string is returned unchanged. `sub(pattern, repl, string, count, regex_ignorecase, regex_dotall, regex_multiline)` Learn more →
tabulate	object (list)	Converts a list of objects into a table (i.e. a two-dimensional array). `tabulate(objects, fields, header)` Learn more →
today	string	Returns the current date. `today(timezone, template)` Learn more →
translate_variant	object	Translate variant into a protein change. `translate_variant(variant, gene_model, transcript, include_effects)` Learn more →
user	object	Returns the currently authenticated user. `user()` Learn more →

`annotate`¶

Annotate a record with a template.

Output data type: object

Syntax

annotate(record, template, debug, include_errors)

record: (object) The record to be annotated
template: (str) The ID of the template
debug: (bool) Enable debug mode (default: False)
include_errors: (bool) Include errors in output (default: True)

`beacon`¶

Retrieves the beacon results for any entity.

Output data type: object

Output object properties:

failed_count: The number of datasets that failed (timed-out)
failed: List of datasets that failed (timed-out)
not_found_count: The number of datasets without results
found_count: The number of datasets with results
found: List of datasets with results
not_found: List of datasets without results

Syntax

beacon(entity, entity_type, beacon_set, datasets, visibility)

entity: The entity value
entity_type: A valid entity type
beacon_set (optional): A valid beacon set ID
datasets (optional): A list of datasets to beacon
visibility (optional): Which datasets to beacon (default: vault)

`classify_variant`¶

Classify a variant using one of multiple classifiers.

Output data type: object

Syntax

classify_variant(variant, classifier)

variant: The variant
classifier: The desired classifier (default: "germline")

`coerce_list`¶

Coerce a value to a list. Single items will become a single value list.

Lists will remain lists. None will return an empty list.

Output data type: auto (list)

Syntax

coerce_list(value)

value: The value to coerce to a list

`concat`¶

Combine text from multiple lists or strings.

Output data type: string

Syntax

concat(values, delimiter)

values: The list of values to concatenate
delimiter (default: ""): The character to use in between values

`crossmap`¶

Convert a variant or genomic region entity between different genome builds

using the Ensembl CrossMap tool. The functionality of this expression is the same as UCSC's liftOver tool.

Output data type: string

Syntax

crossmap(entity, target_build)

entity: The entity (either a valid SolveBio variant BUILD-CHROMOSOME-START-STOP-ALT or genomic region BUILD-CHROMOSOME-START-STOP)
target_build: The target genome build (GRCH37 or GRCH38)

Examples

crossmap("GRCH38-13-32338647-32338647-T", "GRCH37")

`dataset_count`¶

Calculate the total number of results (or "hits") for a given query.

Returns the number of results.

Output data type: integer

Syntax

dataset_count(dataset, entities, filters, query)

dataset: Any dataset with query permissions
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): A valid filter block
query (optional): A query string

`dataset_entity_top_terms`¶

Retrieve the top entities for any entity field in a dataset.

Returns a list of strings, in order of occurrence or None if the dataset can not be queried by this entity.

Output data type: string (list)

Syntax

dataset_entity_top_terms(dataset, entity, limit, filters, query)

dataset: Any dataset with query permissions
entity: The entity_type to return within the dataset
limit (optional): The number of terms to retrieve (default: 1000)
filters (optional): Dataset filters
query (optional): A query string

Examples

dataset_entity_top_terms("solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCH38", "gene")

`dataset_field_percentiles`¶

Calculates the percentiles for any integer field.

Returns an object containing the desired percentiles.

Output data type: object

Syntax

dataset_field_percentiles(dataset, field, percents, entities, filters, query)

dataset: Any dataset with query permissions
field: The field within the dataset
percents: The percentiles to calculate (default: 1, 5, 25, 50, 75, 95, 99)
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): Dataset filters
query (optional): A query string

`dataset_field_stats`¶

Calculates statistics for any numeric field.

Returns an object containing field statistics.

Output data type: object

Output object properties:

count: The total number of values
max: The maximum value observed
sum: The sum of all values
avg: The average value
min: The minimum value observed

Syntax

dataset_field_stats(dataset, field, entities, filters, query)

dataset: Any dataset with query permissions
field: The field within the dataset
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): Dataset filters
query (optional): A query string

`dataset_field_terms_count`¶

Retrieve the number of unique terms for any string field in a dataset.

Returns the number of unique terms.

Output data type: integer

Syntax

dataset_field_terms_count(dataset, field, entities, filters, query)

dataset: Any dataset with query permissions
field: The field within the dataset
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): Dataset filters
query (optional): A query string

Examples

dataset_field_terms_count("solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCh38", "clinical_significance")

`dataset_field_top_terms`¶

Retrieve the top terms for any string field in a dataset.

Returns a list of objects containing the term and number of times it occurs, in order of occurrence.

Output data type: object (list)

Output object properties:

count: Number of times it occurs
term: Term value

Syntax

dataset_field_top_terms(dataset, field, limit, entities, filters, query)

dataset: Any dataset with query permissions
field: The field within the dataset
limit (optional): The number of terms to retrieve (default: 10)
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): Dataset filters
query (optional): A query string

Examples

dataset_field_top_terms("solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCh38", "clinical_significance")

`dataset_field_values`¶

Retrieves a list of non-empty values for a dataset field.

Returns a list of values from the specified field.

Output data type: auto (list)

Syntax

dataset_field_values(dataset, field, limit, entities, filters, query)

dataset: Any dataset with query permissions
field: The field within the dataset
limit (optional): The number of values to return (default: 10)
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): Dataset filters
query (optional): A query string

`dataset_query`¶

Query any dataset with optional filters and/or entities.

Returns a list of results.

Output data type: object (list)

Syntax

dataset_query(dataset, fields, limit, entities, filters, query)

dataset: Any dataset with query permissions
fields (optional): Fields to retrieve (default: all)
limit (optional): The number of values to return (default: 1)
entities (optional): A list of entity tuples: [(entity_type, entity)]
filters (optional): Dataset filters
query (optional): A query string

Examples

dataset_query("solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCh38", fields=["clinical_significance"], query="*cancer*")

dataset_query("solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCh38", entities=[["variant", "GRCH38-13-32357842-32357842-TA"]])

`datetime_format`¶

Format datetime strings. By default, it returns an ISO 8601 format date time string.

To override, provide an optional input_format or output_format to be used.

Output data type: string

Syntax

datetime_format(value, input_format, output_format)

value: (str) A string containing a date/time stamp
input_format: (str) The input format of the date (e.g. "%d/%m/%y %H:%M")
output_format: (str) The output format of the date (ISO 8601 format is the default: "%Y-%m-%dT%H:%M:%S")

`entity_ids`¶

Retrieve one or more normalized entity IDs for a query.

Output data type: string

Syntax

entity_ids(entity_type, entity)

entity_type: The entity type to retrieve
entity: The entity or query string

`error`¶

Raise a FunctionError

Output data type: error

Syntax

error(message)

message: An error message to raise

`explode`¶

Split N values from M list fields into N records.

If _id is in the original record, each new record will have an integer appended to the _id with the index of each exploded record.

Output data type: object (list)

Syntax

explode(record, fields)

record: (object) The record to be splitted
fields: (list or tuple) the fields IDs

`findall`¶

Returns all non-overlapping matches of pattern in string,

as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, returns a list of groups.

Output data type: string (list)

Syntax

findall(pattern, string, regex_ignorecase, regex_dotall, regex_multiline)

pattern: The regular expression pattern
string: The string to search
regex_ignorecase (default: None): With a "regex" pattern, will perform a case insensitive matching.
regex_dotall (default: None): With a "regex" pattern, will make the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.
regex_multiline (default: None): With a "regex" pattern, when specified, the pattern character "^" matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character " $" matches at the end of the string and at the end of each line (immediately preceding each newline). By default, "^" matches only at the beginning of the string, and "$ " only at the end of the string and immediately before the newline (if any) at the end of the string.

`genomic_sequence`¶

Retrieves a specific sequence from the genome.

Output data type: string

Syntax

genomic_sequence(genomic_region)

genomic_region: A valid genomic region in the form: BUILD-CHROMOSOME-START-STOP

Examples

genomic_sequence("GRCh37-5-36241400-36241700")

`get`¶

Get the value at any depth of a nested object based on the path

described by path. If path doesn't exist, default is returned.

Output data type: auto

Syntax

get(obj, path, default)

obj: (list|dict) The object to process
path: (str|list) List or . delimited string of path describing path.
default (keyword): Default value to return if path doesn't exist. Defaults to None.

`melt`¶

Convert a wide dataset to a long dataset by "melting" one or more fields

into "key" and "value" fields. All fields must have the same data type.

Output data type: object (list)

Syntax

melt(record, fields, key_field, value_field, melt_list_values)

record: (object) The record to be melted
fields: (list or tuple) the fields IDs
key_field: (str) key field (default: "key")
value_field: (str) value field (default: "value")
melt_list_values: (bool) (default: False)

`normalize_aa_change`¶

Normalize an amino acid change (beta)

Output data type: string

Syntax

normalize_aa_change(aa_change, ref, alt)

aa_change: The aa_change
ref: (optional) Reference allele
alt: (optional) Alternate allele

`normalize_variant`¶

Normalize a variant ID (minimal representation and left shifting).

Output data type: string

Syntax

normalize_variant(variant)

variant: The variant

`now`¶

Retrieves the current date and time.

Output data type: string

Syntax

now(timezone, template)

timezone (default: EST): The timezone to use for the date
template (default: ISO 8601): The format in which to represent the date/time, defaults to ISO 8601 format (%Y-%m-%dT%H:%M:%S)

`predict_variant_effects`¶

Predict the effects of a variant using Veppy.

Output data type: object (list)

Output object properties:

so_term: The Sequence Ontology term
impact: The effect impact
so_accession: The Sequence Ontology accession number
transcript: The affected transcript ID
lof: True if the mutation is predicted to cause the protein to lose its function

Syntax

predict_variant_effects(variant, default_transcript, gene_model)

variant: The variant
default_transcript (optional): If True, return effects for just the default transcript. If a specific transcript, then limits results to this transcript only. Otherwise returns effects for all transcripts.
gene_model (optional): The desired gene model: refseq (default) or ensembl

Examples

predict_variant_effects("GRCH38-7-117559590-117559593-A")

`prevalence`¶

Calculates the frequency that a value occurs within a population.

Typically used to calculate the prevalence of variants or genes across samples in a dataset. Returns the frequency of occurrence.

Please note: in large datasets the result is approximate and can have an error of up to 5%.

Output data type: double

Syntax

prevalence(dataset, entity, sample_field, filters)

dataset: Any dataset with discover permissions
entity: A single entity tuple: (entity_type, entity)
sample_field: The field containing the sample IDs
filters (optional): Filters to apply on the dataset

`search`¶

Scan through string looking for the first location where

the regular expression pattern produces a match. Returns True on a match and False if no position in the string matches the pattern.

Output data type: boolean

Syntax

search(pattern, string, regex_ignorecase, regex_dotall, regex_multiline)

pattern: The regular expression pattern
string: The string to search
regex_ignorecase (default: None): With a "regex" pattern, will perform a case insensitive matching.
regex_dotall (default: None): With a "regex" pattern, will make the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.
regex_multiline (default: None): With a "regex" pattern, when specified, the pattern character "^" matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character " $" matches at the end of the string and at the end of each line (immediately preceding each newline). By default, "^" matches only at the beginning of the string, and "$ " only at the end of the string and immediately before the newline (if any) at the end of the string.

`search_groups`¶

Scan through string looking for the first location where

the regular expression pattern produces a match. Returns a list of strings corresponding to the groups in the pattern.

Output data type: string (list)

Syntax

search_groups(pattern, string, regex_ignorecase, regex_dotall, regex_multiline)

pattern: The regular expression pattern
string: The string to search
regex_ignorecase (default: None): With a "regex" pattern, will perform a case insensitive matching.
regex_dotall (default: None): With a "regex" pattern, will make the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.
regex_multiline (default: None): With a "regex" pattern, when specified, the pattern character "^" matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character " $" matches at the end of the string and at the end of each line (immediately preceding each newline). By default, "^" matches only at the beginning of the string, and "$ " only at the end of the string and immediately before the newline (if any) at the end of the string.

`split`¶

Split text based on a delimiter and optionally strip whitespace.

Output data type: string (list)

Syntax

split(value, delimiter, regex, strip, regex_ignorecase, regex_dotall, regex_multiline)

value: The string to split
delimiter (default: any whitespace): The character(s) to split on
regex (default: None): A valid Python regular expression pattern to split on.
strip (default: True): Strip whitespace from each resulting value
regex_ignorecase (default: None): With a "regex" pattern, will perform a case insensitive matching.
regex_dotall (default: None): With a "regex" pattern, will make the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.
regex_multiline (default: None): With a "regex" pattern, when specified, the pattern character "^" matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character " $" matches at the end of the string and at the end of each line (immediately preceding each newline). By default, "^" matches only at the beginning of the string, and "$ " only at the end of the string and immediately before the newline (if any) at the end of the string.

`sub`¶

Return the string obtained by replacing the leftmost

non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn't found, string is returned unchanged.

Output data type: string

Syntax

sub(pattern, repl, string, count, regex_ignorecase, regex_dotall, regex_multiline)

pattern: The regular expression pattern
repl: The string to replace matches with
string: The string to search
count: (default: 0) The maximum number of pattern occurrences to be replaced.If zero, all occurrences will be replaces.
regex_ignorecase (default: None): With a "regex" pattern, will perform a case insensitive matching.
regex_dotall (default: None): With a "regex" pattern, will make the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.
regex_multiline (default: None): With a "regex" pattern, when specified, the pattern character "^" matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character " $" matches at the end of the string and at the end of each line (immediately preceding each newline). By default, "^" matches only at the beginning of the string, and "$ " only at the end of the string and immediately before the newline (if any) at the end of the string.

`tabulate`¶

Converts a list of objects into a table (i.e. a two-dimensional array).

Output data type: object (list)

Syntax

tabulate(objects, fields, header)

objects: The list of objects
fields (optional): List of fields to include (default: all)
header (optional): Include a header row (default: True)

`today`¶

Returns the current date.

Output data type: string

Syntax

today(timezone, template)

timezone (default: EST): The timezone to use for the date
template (default: YYYY-MM-DD): The format in which to represent the date

`translate_variant`¶

Translate variant into a protein change.

Output data type: object

Output object properties:

protein_length: Number of amino acids in the protein
cdna_change: cDNA change
protein_change: Protein change
protein_coordinates: A dictionary containing start and stop coordinatesand the affected transcript id
gene: HUGO gene symbol
transcript: The transcript ID
effects: list of effects

Syntax

translate_variant(variant, gene_model, transcript, include_effects)

variant: The variant
gene_model (optional): The desired gene model: refseq (default) or ensembl
transcript (optional): Limits results to this transcript only
include_effects (optional): Returns the effects of the variant using Veppy

Examples

translate_variant("GRCH38-7-117559590-117559593-A")

translate_variant("GRCH38-7-117559590-117559593-A", gene_model="ensembl")

translate_variant("GRCH38-7-117559590-117559593-A", transcript="NM_000492.3")

translate_variant("GRCH38-7-117559590-117559593-A", include_effects=True)

`user`¶

Returns the currently authenticated user.

Output data type: object

Output object properties:

name: The user's full name.
email: The user's email address.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.

Expression Functions¶

Function List¶

annotate¶

beacon¶

classify_variant¶

coerce_list¶

concat¶

crossmap¶

dataset_count¶

dataset_entity_top_terms¶

dataset_field_percentiles¶

dataset_field_stats¶

dataset_field_terms_count¶

dataset_field_top_terms¶

dataset_field_values¶

dataset_query¶

datetime_format¶

entity_ids¶

error¶

explode¶

findall¶

genomic_sequence¶

get¶

melt¶

normalize_aa_change¶

normalize_variant¶

now¶

predict_variant_effects¶

prevalence¶

search¶

search_groups¶

split¶

sub¶

tabulate¶

today¶

translate_variant¶

user¶

`annotate`¶

`beacon`¶

`classify_variant`¶

`coerce_list`¶

`concat`¶

`crossmap`¶

`dataset_count`¶

`dataset_entity_top_terms`¶

`dataset_field_percentiles`¶

`dataset_field_stats`¶

`dataset_field_terms_count`¶

`dataset_field_top_terms`¶

`dataset_field_values`¶

`dataset_query`¶

`datetime_format`¶

`entity_ids`¶

`error`¶

`explode`¶

`findall`¶

`genomic_sequence`¶

`get`¶

`melt`¶

`normalize_aa_change`¶

`normalize_variant`¶

`now`¶

`predict_variant_effects`¶

`prevalence`¶

`search`¶

`search_groups`¶

`split`¶

`sub`¶

`tabulate`¶

`today`¶

`translate_variant`¶

`user`¶