Skip to content

Exporting Data

Overview

SolveBio is committed to data access and data portability. Exporting data to downstream tools is a key part of molecular data analytics and our goal is to make that a seamless process.

Dataset can be exported in multiple formats:

  • JSON: JSON Lines format (gzipped).
  • CSV: Comma Separated Value format (flattened, gzipped).
  • TSV: Tab Separated Value format (flattened, gzipped).
  • Excel (XLSX): Microsoft Excel format (flattened).

An exported JSON file can be re-imported into SolveBio without any modification.

Exporting data can take anywhere from a few seconds to tens of minutes, depending on the number of records and selected format. Exports are processed server-side, and the output is a downloadable file.

Export Limits

Different export formats have different limits.

Format Max Records
Excel 1,048,576 records
JSON 500,000,000 records
TSV 500,000,000 records
CSV 500,000,000 records

Flattened Fields (CSV/XLSX only)

CSV and XLSX exports are processed by a flattening algorithm during export. The reason for this is to handle list fields, which are not well supported by Excel and other CSV readers. The following example illustrates the effects of the flattening algorithm:

The following dataset records:

1
2
3
{"a": "a", "b": ["x"]}
{"a": "a", "b": ["x", "y"]}
{"a": "a", "b": ["x", "y", "z"]}

will be exported to the following CSV:

1
2
3
4
a,b.0,b.1,b.2
a,x,,
a,x,y,
a,x,y,z

Export a Dataset

To export a dataset, retrieve it by name or ID, and initiate the export. Exports can take a few minutes for large datasets. You can always start a large export and check back when it finishes on the Activity tab of the SolveBio website. Exports can also be saved directly into a vault (with target_full_path keyword argument) and accessed from there.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from solvebio import Dataset

dataset = Dataset.get_by_full_path('solvebio:public:/HGNC/3.1.0-2017-06-29/HGNC')

# Export the entire dataset (~40k records), this may take a minute...
# NOTE: `format` can be: json, tsv, csv, or excel
#       `send_email_on_completion`: enable/disable sending an email when the export is ready
export = dataset.export(format='json', follow=True, send_email_on_completion=True)

# Save the exported file to the current directory
export.download('./')

# Exports can also be saved to a path in a vault
dataset.export(target_full_path='my_vault:/path/to/json_files_folder/my_export')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(solvebio)

dataset <- Dataset.get_or_create_by_full_path('solvebio:public:/HGNC/3.1.0-2017-06-29/HGNC')

# Export the entire dataset (~40k records), this may take a minute...
# NOTE: `format` can be: json, tsv, csv, or excel
export <- DatasetExport.create(
    dataset$id,
    format = 'csv',
    params = NULL,
    send_email_on_completion = TRUE
)

# Wait for the export to complete
Dataset.activity(dataset$id)

# Download
url <- DatasetExport.get_download_url(export$id)
download.file(url, 'data.csv.gz')

# An exports can also be saved to a path in a vault
export <- DatasetExport.create(
    dataset$id,
    format = 'csv',
    params = NULL,
    send_email_on_completion = TRUE,
    target_full_path='my_vault:/path/to/csv_files_folder/my_export'
)

Exporting Large Amounts of Data

An example file size for a CSV file with 150M rows and 50 columns populated with floats and relatively short strings is about 50GB. In general we recommend not working with files this size.

Shrink the export by applying filters or selecting only specific columns. If necessary, export in batches (e.g. export by chromosome or sample).

Export a Filtered Dataset

In this example we will export a slice of a dataset. This leverages the dataset filtering system.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from solvebio import Dataset

dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Filter the dataset by field values, limit the number of results, select a subset of fields
query = dataset.query(limit=100, fields=["variant_sbid", "review_status_star", "gene_symbol"]).filter(review_status_star__gte=3)

# Export the query (100 records, filtered on a field)
# NOTE: `format` can be: json, tsv, csv, or excel
#       `send_email_on_completion`: enable/disable sending an email when the export is ready
export = query.export(format='json', follow=True, send_email_on_completion=True)

# Save the exported file to a specific location (optionally with a specific name)
export.download(path='./my_variants.json.gz')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(solvebio)

dataset <- Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Filter the dataset by field values and limit the number of results
# NOTE: `format` can be: json, tsv, csv, or excel
filters <- list(list("review_status_star__gte", 3))
fields <- list("variant_sbid", "review_status_star", "gene_symbol")
export <- DatasetExport.create(
    dataset$id,
    format = 'json',
    params=list(filters=filters, fields=fields, limit=100),
    follow = TRUE,
)

# Download to your home directory
url <- DatasetExport.get_download_url(export$id)
download.file(url, 'my_variants.json.gz')

Export in Batches

In this example we will export the data by chromosome.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from solvebio import Dataset

dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Get available chromosomes
facets = dataset.query().facets(**{'genomic_coordinates.chromosome': {'facet_type': 'terms', 'limit': 100}})
print("Found {} chromosomes".format(len(facets['genomic_coordinates.chromosome'])))
for chromosome, records_count in facets['genomic_coordinates.chromosome']:

    # Defines a location on SolveBio to export to (this is a shortcut to the user's personal vault)
    # Appends chromosome to the filename
    target_full_path = "~/clinvar_{}.csv.gz".format(chromosome)

    # Filter the query by chromosome
    filtered_query = query.filter(**{'genomic_coordinates.chromosome': chromosome})

    # Export
    filtered_query.export(format='csv-gz', follow=False, send_email_on_completion=True, target_full_path=target_full_path)