Dataset Imports¶
create¶
HTTP Request
POST https://api.solvebio.com/v2/dataset_imports
Parameters
This request does not accept URL parameters.
Authorization
This request requires an authorized user with write permission on the dataset.
Request Body
In the request body, provide an object with the following properties:
Property | Value | Description |
---|---|---|
commit_mode | string | A valid commit mode. |
dataset_id | integer | The target dataset to import into. |
object_id | integer | (optional) The ID of an existing object on SolveBio. |
manifest | object | (optional) A file manifest (see below). |
data_records | objects | (optional) A list of records to import synchronously. |
description | string | (optional) A description of this import. |
entity_params | object | (optional) Configuration parameters for entity detection. |
reader_params | object | (optional) Configuration parameters for readers. |
validation_params | object | (optional) Configuration parameters for validation. |
annotator_params | object | (optional) Configuration parameters for the Annotator. |
include_errors | boolean | If True, a new field (_errors ) will be added to each record containing expression evaluation errors (default: True). |
target_fields | objects | A list of valid dataset fields to create or override in the import. |
priority | integer | A priority to assign to this task |
When creating a new import, either manifest
, object_id
or data_records
must be provided. Using a manifest allows you to import a remote file accessible by HTTP(S), for example:
1 2 3 4 5 6 7 8 9 10 11 | # Example Manifest { "files": [{ "url": "https://example.com/file.json.gz", "name": "file.json.gz", "format": "json", "size": 100, "md5": "", "base64_md5": "" }] } |
Response
The response returns "HTTP 201 Created", along with the DatasetImport resource when successful.
Reader Parameters¶
Reader | Reader name | Extension |
---|---|---|
VCF | vcf | .vcf |
JSONL | json | .json |
CSV | csv | .csv |
TSV | tsv | .tsv, .txt, .maf |
XML | xml | .xml |
GTF | gtf | .gtf |
GFF3 | gff3 | .gff3 |
Nirvana JSON | nirvana | .json |
SolveBio will automatically select a reader based on the imported file's extension. This is not applicable to the Nirvana JSON file because it has .json
extension the same as the JSONL file, so the reader
attribute has to be set manually to nirvana
.
In the case where the extension is not recognized, you can manually select a reader using the reader
attribute of reader_params
by setting the associated reader name as its value:
# Force the JSONL reader reader_params = { 'reader': 'json' } imp = DatasetImport.create( reader_params=reader_params ... )
JSON (JSONL)¶
The JSONL format supported by SolveBio has four requirements (adapted from jsonlines.org):
1. UTF-8 encoding
JSON allows encoding Unicode strings with only ASCII escape sequences, however those escapes will be hard to read when viewed in a text editor. The author of the JSON Lines file may choose to escape characters to work with plain ASCII files.
Non-ascii content may be corrupted during the import process if non-UTF-8 files are imported.
2. Each line must be a complete JSON object
Specifically, each line must be a JSON object without any internal line-breaks. For example, here are three records:
{"field": 1} {"field": 2} {"field": 3}
3. Lines are separated by '\n
'
This means '\r\n
' is also supported because trailing white space is ignored when parsing JSON values.
The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.
4. The file extension must be .json
or .json.gz
JSON Lines files for SolveBio must be saved with the .json
extension. Files may be gzipped, resulting in the .json.gz
extension.
CSV/TSV¶
The following parameters can be passed for files that end with extension .csv
, .tsv
, and .txt
.
delimiter
: A one-character string used to separate fields. Defaults to ',' for CSVs and '\t' for TSVs and TXT filesquotechar
: A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '"'.header
: The row number (starting from 0) containing the header,None
(to indicate no header), orinfer
(default). By default, column names are inferred from the first row unlessheadercols
are provided, in which case the file is assumed to have no header row. Setheader = 0
and provideheadercols
to replace existing headers.headercols
: A list of field names that represent column headers. By default, providingheadercols
assumes the file has no header. To replace existing headers, setheader = 0
). The order of the columns matters and must match the number of delimited columns in each line.comment
: A string used to determine which lines in the files are comments. Lines that begin with this string will be ignored. Default is '#'.skiprows
: A list of integers that define the line numbers of the file that should be skipped. The first line is line 0. Default is [].skipcols
: A list of integers that define the columns of the file that should be ignored. The first column is column 0. Default is [].
Column Ordering¶
This reader will preserve the column order of the original file, unless otherwise overridden with an import template.
Numeric Fields¶
This reader will cast all numeric fields to doubles.
The following example modifies the default CSV reader to handle a pipe-delimited file with 5 header rows:
# Custom reader params for a pipe-delimited "CSV" file with 5 header rows: csv_reader_settings = { 'delimiter': '|', 'skiprows': [0, 1, 2, 3, 4] } imp = DatasetImport.create( reader_params=csv_reader_settings ... )
VCF¶
The following parameters can be passed for files that end with extension .vcf
.
genome_build
: The string 'GRCh37' or 'GRCh38'. If nogenome_build
is passed an attempt to guess the build will be made from the file headers and will fallback toGRCh37
if nothing is found.explode_annotations
: Default: False - Will explode the annotations column of the VCF by creating one new record per annotation. By default it will look for annotations at theANN
column within theinfo
object (info.ANN
). This key can be configured with theannotations_key
parameter.annotations_key
: The field name that contains the VCF annotations. For use withexplode_annotations
parameter. The default key isANN
.sample_key
: The field name that the VCF parser will output the VCF samples to. The default key issample
.
XML¶
The following parameters can be passed for files that end with extension .xml
.
item_depth
: An integer that defines at which XML element to begin record enumeration. Default is 1. A depth of 0 would be the XML document root element and would return a single record.required_keys
: A list of strings that represent items that must exist in the XML element. Otherwise the record will be ignored.cdata_key
: A string that identifies the text value of a node element. Default istext
attr_prefix
: A string representing the default prefix for node attributes. Default is@
Example XML Document
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | <xml> <library> <shelf> <book lang="eng"> <title>SolveBio Docs, 3rd Edition</title> <publish_date></publish_date> <summary></summary> </book> <book lang="eng"> <title></title> <publish_date></publish_date> <summary></summary> </book> </shelf> <shelf> <book> <title></title> </book> </shelf> </library> </xml> |
item_depth=0
would parse a single library record with nested shelves and books.
item_depth=1
would parse two shelf records with nested books.
item_depth=2
would parse three book records.
required_keys=[]
would return 3 book records
required_keys=['title']
would also return 3 book records
required_keys=['summary']
would return only 2 book records
cdata_key="value"
would return the field name book.title.value
attr_prefix=""
would return the field name book.lang
attr_prefix="_"
would return the field name book._lang
Example XML Document
1 2 3 4 5 6 7 8 9 10 | <xml> <library> <shelf> <book></book> </shelf> <shelf> <dust></dust> </shelf> </library> </xml> |
item_depth=1 and required_keys=['dust']
would parse 1 shelf record.
GFF3¶
The following parameters can be passed for files that end with extension .gff3
.
comment
: A string used to determine which lines in the files are comments. Lines that begin with this string will be ignored. Default is '##'.
Nirvana JSON¶
The Nirvana JSON format supported by SolveBio has to meet the official Illumina's Nirvana JSON layout in order to be parsed properly.
Entity Detection Parameters¶
When importing data, every field is sampled and to determine if it is a SolveBio entity. The following configuration parameters allow for customization of this detection by setting entity_params
on the import object.
Genes and variants are detected by default. The example below overrides this and attempts to detect only genes and literature entities:
imp = DatasetImport.create( dataset_id=dataset.id, object_id=object.id, entity_params={ 'entity_types': ['gene', 'literature'] } )
To completely disable entity detection, use the disable
attribute:
imp = DatasetImport.create( dataset_id=dataset.id, object_id=object.id, entity_params={ 'disable': True } )
Validation Parameters¶
The following settings can be passed to the validation_params
field.
disable
- (boolean) default False - Disables validation completelyraise_on_errors
- (boolean) default False - Will fail the import on first validation error encountered.strict_validation
- (boolean) default False - Will upgrade all validation warnings to errors.allow_new_fields
- (boolean) default False - If strict validation is True, will still allow new fields to be added
Validation will raise the following errors and warnings. The list below represents them in the following format: [Error code] Name - Description
Warnings¶
- [202] Column Name Warning: Column name uses characters that do not comply with strict column name validation. (upgraded to an Error if
strict_validation=True
) - [203] New Column added: A new column was added to the Dataset (upgraded to an Error if
strict_validation=True
andallow_new_fields=False
) - [302] List Expected violation: A column expected a list of values but didn't receive them. For example, a field has
is_list=True
but received a single string (upgraded to an Error ifstrict_validation=True
) - [303] Unexpected List violation: A column expected a single value but received a list of values. For example a field has
is_list=False
but received a list of strings. (upgraded to an Error ifstrict_validation=True
) - [400] Too Many Columns in record: Warns if 150 or more columns are found. Errors if 400 or more.
Errors¶
- [301] Invalid Value for Field: Value is not a valid type (e.g. An integer passed for a
date
field data_type) - [304] NaN Value for Field: Value is a JSON "NaN" value which can not be indexed by SolveBio.
- [305] Infinity Value for Field: Value is a JSON "Infinity" value which can not be indexed by SolveBio.
- [306] Max String Length for Field: The max value for the string
data_type
is 32,766 bytes. Anything larger must be a textdata_type.
Annotator Parameters¶
The following settings can be used to customize the annotator that is used during transformation.
annotator
- (string) Choose fromsimple
(default),serial
orparallel
delete¶
Not recommended
Deleting dataset imports is not recommended as data provenance will be lost.
HTTP Request
DELETE https://api.solvebio.com/v2/dataset_imports/{ID}
Parameters
This request does not accept URL parameters.
Authorization
This request requires an authorized user with write permissions on the dataset.
Request Body
Do not supply a request body with this method.
Response
The response returns "HTTP 200 OK" when successful.
get¶
HTTP Request
GET https://api.solvebio.com/v2/dataset_imports/{ID}
Parameters
This request does not accept URL parameters.
Authorization
This request requires an authorized user with permission.
Request Body
Do not supply a request body with this method.
Response
The response contains a DatasetImport resource.
list¶
HTTP Request
GET https://api.solvebio.com/v2/datasets/{DATASET_ID}/imports
Parameters
This request accepts the following parameters:
Parameter | Value | Description |
---|---|---|
limit | integer | The number of objects to return per page. |
offset | integer | The offset within the list of available objects. |
Authorization
This request requires an authorized user with read permission on the dataset.
Response
The response contains a list of DatasetImport resources.