Data Sharing with INSDC

As part of our effort to enable global scientific collaboration and facilitate international public health response we share open sequences with the INSDC databases (ENA, DDBJ, NCBI) and ingest public sequences from the INSDC. Read more about how your sequences are shared here and here.

This means that when you browse for an pathogen on Pathoplexus you should also see all other publicly available sequences of that pathogen (that satisfy our sequence alignment requirements).

We download sequences from the INSDC using the NCBI Datasets Virus Data Package. We then map INSDC metadata fields to Loculus metadata fields before uploading the sequences to Loculus. In order to give users access to as much data as possible we do not enforce required metadata fields on data ingested from the INSDC, however we do enforce that sequences alignment is of an acceptable quality. A quality score is based on the standard for each pathogen, defined in the Nextclade datasets that we use.

Additionally, for multi-segmented organisms the INSDC often does not offer data where segments have been grouped by isolate. To retain as much information as possible from the samples, we additionally group samples based on their isolate and other metadata fields. By default all metadata fields must be the same across segments for us to group them as one sample, with the exception of segment-specific metadata fields. These fields are either alignment-related (length, totalSnps, totalInsertedNucs, totalDeletedNucs, totalUnknownNucs, totalAmbiguousNucs, totalFrameShifts, frameShifts, completeness) or related to the INSDC-accession for that specific segment (ncbiUpdateDate, insdcAccessionBase, insdcAccessionFull, insdcVersion).

Mapping of NCBI VIirus Metadata Fields to Loculus Metadata Fields

NCBI VIirus Field NameLoculus Field name
AccessioninsdcAccessionFull (also produces insdcAccessionBase, insdcVersion)
BioProjectsbioprojectAccession
BioSample accessionbiosampleAccession
Geographic LocationgeoLocCountry, geoLocAdmin1
Geographic RegiongeoLocAdmin2
Host Common NamehostNameCommon
Host Infraspecific Names SexhostGender
Host NamehostNameScientific
Host Taxonomic IDhostTaxonId
Is Lab HostisLabHost
Isolate Collection datesampleCollectionDate
Isolate LineagespecimenCollectorSampleId
Purpose of SamplingpurposeOfSampling
Release datencbiReleaseDate
SRA AccessionssraRunAccession
Source databasencbiSourceDb
Submitter AffiliationauthorAffiliations
Submitter Namesauthors
Update datencbiUpdateDate
Virus NamencbiVirusName
Virus Taxonomic IDncbiVirusTaxId