Datasets Overview¶

What are Datasets?¶

A dataset on SolveBio represents a collection of records indexed with a specific schema (i.e. a list of fields). Datasets make it easy to query and filter datasets of any size in real time.

Datasets can be created with a predefined schema (using a template) or without any fields. SolveBio will always detect new fields in imported records, however we recommend the use of templates so that the data types of fields can be set in advance. Records can be added to datasets in multiple ways such as transforming files into datasets, programmatically generating records or copying records from other datasets using "migrations".

All SolveBio datasets are stored in vaults. Vaults are similar to filesystems in that they are a hierarchical object store. However, unlike other filesystems, vaults have special knowledge of SolveBio datasets, and allow for certain actions to be performed on them. Vaults can be private to you, or shared with specific people and groups within your organization. SolveBio provides a public vault that contains public datasets continually maintained and updated by SolveBio staff. For example, SolveBio provides the ClinVar and GWAS Catalog datasets.

Dataset Features¶

Datasets are designed to be a flexible and scalable solution for storing structured JSON-compatible data. The molecular data landscape is filled with a large variety of unique file formats, each with their own subtleties and quirks. On SolveBio, almost any data source can be imported into a dataset as long as it can be transformed into JSON. SolveBio supports many formats, making it easy to import your data into a dataset. Once your data has been imported into a dataset, you can take advantage of the many features they offer:

Scalability¶

There are no limits to the amount of data that can be stored within a single dataset. When creating your dataset, you can specify its intended "capacity" depending on how large you think it will grow. The default capacity is "small" which will support datasets in the tens of millions of records. While the capacity cannot be changed on an existing dataset you can always migrate your data to a new, larger capacity dataset.

Learn more about creating datasets →

Flexible Schemas¶

Most database systems require careful planning of schemas prior to loading any data at all. While we still recommend putting thought into your schemas, the messy nature and size of biological datasets often make it difficult to know their structure and complexity in advance. SolveBio automatically detects new fields (including data types and entity types) and applies validation on all imported data. This means you can reliably throw any data into SolveBio and explore your data in minutes instead of hours. The data migration features (see below) make it easy to further clean up your data and refine your schemas as they evolve.

Learn more about importing data →

Query and Filter¶

Flat files come in many shapes, sizes, and formats, making them difficult to work with. Datasets are fully indexed, meaning they can be queried and filtered in real-time. The built-in query APIs make it easy to build complex, nested queries in your code or using the web UI. You can even run field-specific statistical and term-based aggregations (known as "facets") on top of any query without downloading any data, vastly simplifying scripts and apps built with SolveBio.

Learn more about querying, filtering, and aggregating data →

Learn more about using Beacons for entity-specific queries →

Portability¶

SolveBio datasets are designed for data portability, making it easy to get data in or out of the platform. Using the dataset export system you can export full or partial datasets in JSON, CSV, or Excel format.

Learn more about exporting datasets →

Data Migrations¶

In addition to importing and exporting data to/from flat files, it is also possible to migrate data between datasets without the data ever leaving the SolveBio platform. This is incredibly useful when copying or transforming the contents of datasets. When combined with expressions, migrations are a powerful and reproducible way to process, annotate, and analyze datasets of any size.

Learn more about transforming data with migrations →

Storage Classes¶

Every dataset has a storage class that defines how it is stored and replicated within the SolveBio system. Certain storage classes are more performant and resilient, and these should be assigned to essential and mission-critical datasets. Setting the storage class to "Archive" will put it in the archived state. Archived datasets can still be seen in vaults and through search, but cannot be queried (except through Global Beacon search). Archived datasets do not consume any active storage space.

Automatic archiving enabled in January 2022

In January 2022, datasets with the "Standard-IA" storage class will be automatically archived after 90 days of inactivity. To prevent this for specific datasets, set their storage class to "Essential".

The following storage classes are supported:

Storage Class Name	Description	Lifecycle
`Standard`	Default storage class for new datasets. This storage class has extra redundancy to improve performance and resiliency. Designed for 99.99% availability over a given year (under 1 hour of unexpected downtime).	Transitions to Standard-IA after 30 days of inactivity.
`Standard-IA`	Storage class for datasets which are infrequently accessed (IA). Designed for 99.9% availability over a given year (under 10 hours of unexpected downtime).	Automatically archived after 90 days of inactivity.
`Essential`	Same as Standard. Useful for datasets critical to pipelines and applications.	No transition (not automatically archived).
`Temporary`	Same as Standard-IA. Useful for test datasets, scratch data or disposable workflows.	Automatically archived after 48 hours of inactivity.
`Performance`	Same as Essential but with extreme redundancy for high performant parallel queries.	No transition (not automatically archived).
`Archive`	Storage class for dormant datasets which remain in the vault, but cannot be queried. They can be restored at any time.	No transition.

Vault default storage classes

A default storage class can be set on any Vault via the vault settings page. Any new dataset created within this vault will be created with that storage class.

Version Control¶

All dataset operations run as asynchronous tasks and are logged to an activity feed (visible only to those with read access). Modifications made to datasets are represented by individual "dataset commits", similar to source code version control systems. Dataset commits can be cancelled and (in some cases) rolled-back if necessary. The dataset commit system functions as a basic version control and audit trail mechanism suitable for many workflows.

Naming Conventions¶

Dataset names can contain any character except forward slash (/). We recommend using folders and naming conventions to differentiate similar datasets. Dataset names must be unique within a folder. Moving a dataset into a folder that contains a dataset with the same name already will cause the moved dataset's name to be auto-incremented (i.e. dataset, dataset-1, dataset-2, ...).

Versioning¶

SolveBio recommends using Semantic Versioning guidelines for reference datasets that have periodic updates. SolveBio uses this convention when naming folders containing public reference datasets, but the same convention can be applied to datasets. Example folder names include: 0.0.1-2014-01-01.

Versioning for public datasets is performed by placing the datasets into folders named for that version. A version consists of up to four parts, and is of the form MAJOR.MINOR.PATCH-RELEASE:

MAJOR denotes backwards-incompatible changes to dataset schemas, or substantial new releases
MINOR denotes significant backwards-compatible changes (i.e. addition of a field)
PATCH denotes backwards-compatible changes (i.e. minor bug fix)
RELEASE contains release metadata (usually a date, but may also correspond to the data source's release number).

Genome Builds¶

Public datasets maintained by SolveBio sometimes have different datasets containing data for specific genome builds (when available). These datasets are affixed with a -GRCh37 or -GRCh38 depending on the genome build. Some datasets may contain coordinates for multiple genome builds.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.