Metadata format

Here you can find a brief description and example of each field you can include in the metadata file uploaded with your samples. Most of these fields are currently optional, but we do have some required fields.

Some fields, like dates, countries and authors, will be standardized so that all entries in the database are easy to process. At the moment, these are the only fields we standardize, and we will keep users updated if that changes in the future.

Please format dates as YYYY-MM-DD. To submit sequences, the full date (including day) is required. If you have older sequences for which the full date isn’t available, or legal reasons relating to releasing the full date, please get in touch via submission@pathoplexus.org.

Please format authors as a string where each author is separated using a semi-colon, and a comma is used to separate first names/initials from last name, e.g. last name, first name;. Last name(s) is mandatory, only ASCII alphabetical characters A-Z are allowed. For example: Smith, Anna; Perez, Tom J.; Xu, X.L.; or Xu,; if the first name is unknown.

You can download all metadata fields and their descriptions here: metadata_fields_descriptions.csv

Required Fields:

Field nameDescriptionExample
submissionIdYour sequence identifier; should match the FASTA file header - this is used to link the metadata to the FASTA sequenceGJP123
sampleCollectionDateThe date on which the sample was collected. Please format YYYY-MM-DD - use XX if unknown, ex: 2020-03-XX or 20XX-XX-XX, and provide at least year2020-03-15
geoLocCountryThe country from which the sample was collected.Canada

Optional Fields:

Desired Fields:

Field nameDescriptionExample
geoLocAdmin1A local administrative region from which the sample was collected (ex: Province, State, Canton)Ontario
geoLocAdmin2A local administrative region from which the sample was collected (ex: county or municipality)City of Toronto
geoLocCityThe city from which the sample was collected.Toronto
specimenCollectorSampleIdIf there is another sample ID attached to the sequence, it can be recorded here (nothing identifiable!)SWG_1001-1B
authorsList of authors who should be listed on the sample. Authors should be separated by semi-colons. Each author’s name should be in the format last name, first name;. Last name(s) is mandatory, only ASCII alphabetical characters A-Z are allowed.Smith, Anna; Perez, Tom J.; Xu,;
authorAffiliationsList of author affiliations in the same order as authors, one entry per authorEVE Group, MPI Department, Swiss TPH, Basel, Switzerland; Anderson Group, IEB Department, School of Biology, University of Edinburgh, Edinburgh, Scotland
cultureIdAn ID useful to linking to a culturePlate7-1124
sampleReceivedDateThe date on which the sample was received by the laboratory.2020-03-20
collectionDeviceThe instrument or container used to collect the sample e.g. swab.Swab
collectionMethodThe process used to collect the sample e.g. phlebotomy, necropsy.Bronchoalveolar lavage (BAL)
sequencingDateThe date the sample was sequenced.2021-04-26
sequencingInstrumentThe make and model of the sequencing instrument/platform used.Oxford Nanopore MinION, Illumina NGS Platforms
sequencingProtocolThe protocol/program used to generate the sequence.Genomes were generated through amplicon sequencing of 1200 bp amplicons with Freed schema primers. Ligation sequencing kit SQK-LSK109 (Oxford Nanopore Technologies) was used for library preparation.
depthOfCoverageThe average number of reads representing a given nucleotide in the reconstructed sequence.400
hostNameScientificThe scientific name of the host from which the sample was collected.Homo sapiens
hostTaxonIdTaxon ID for the host9606
versionCommentReason for revising sequences or other general comments concerning a specific versionFixed an issue in previous version where low-coverage nucleotides were erroneously filled with reference sequence
Field nameDescriptionExample
geoLocAdmin1A local administrative region from which the sample was collected (ex: Province, State, Canton)Ontario
geoLocAdmin2A local administrative region from which the sample was collected (ex: county or municipality)City of Toronto
geoLocCityThe city from which the sample was collected.Toronto
geoLocSiteThe name of a specific geographical location e.g. Credit River (rather than river).Credit River
geoLocLatitudeGeo-coordinate latitude in decimal degree (WGS84) format, i.e. values in range -90 to 90, where positive values are north of the Equator.-34.603722 (latitude of Buenos Aires)
geoLocLongitudeGeo-coordinate longitude in decimal degree (WGS84) format, i.e. values in range -180 to 180, where positive values are east of the Prime Meridian.-58.381592 (longitude of Buenos Aires)
specimenCollectorSampleIdIf there is another sample ID attached to the sequence, it can be recorded here (nothing identifiable!)SWG_1001-1B
cultureIdAn ID useful to linking to a culturePlate7-1124
isLabHostIf a laboratory host (e.g. cultured cell line) was used to propagate the sample.True
cellLinePopulation of cells derived from a single cell or group of cells, which are cultured in the laboratory under controlled conditions
passageNumberThe number of times a cell culture has been subcultured or transferred from one vessel to another.12
passageMethodThe techniques used to subculture cellsEnzymatic Detachment
sampleReceivedDateThe date on which the sample was received by the laboratory.2020-03-20
sampleTypeMethod of sampling.Nasopharyngeal Swabs
purposeOfSamplingThe reason that the sample was collected.Diagnostic testing
presamplingActivityThe activities or variables introduced upstream of sample collection that may affect the sample collected.Antimicrobial pre-treatment
anatomicalMaterialA substance obtained from an anatomical part of an organism e.g. tissue, blood, saliva, fluid (Cerebrospinal (CSF), Pericardial, Pleural, Vaginal, Amniotic)Blood
anatomicalPartThe anatomical part of the organism the sample was taken from e.g. oropharynx.Nasopharynx (NP)
bodyProductA substance excreted/secreted from an organism e.g. feces, urine, sweat.Feces
environmentalMaterialA substance obtained from the natural or man-made environment e.g. soil, water, sewage, door handle, bed handrail, face mask.Face mask
environmentalSiteAn environmental location may describe a site in the natural or built environment e.g. hospital, wet market, bat cave.Hospital
collectionDeviceThe instrument or container used to collect the sample e.g. swab.Swab
collectionMethodThe process used to collect the sample e.g. phlebotomy, necropsy.Bronchoalveolar lavage (BAL)
foodProductA material consumed and digested for nutritional value or enjoyment. Include animal feed.Bone meal, Chicken breast
foodProductPropertiesAny characteristic of the food product pertaining to its state, processing, a label claim, or implications for consumers.Food (chopped)
specimenProcessingThe processing applied to samples post-collection, prior to further testing, characterization, or isolation procedures.Samples pooled
specimenProcessingDetailsDetailed information regarding the processing applied to a sample during or after receiving the sample.25 swabs were pooled and further prepared as a single sample during library prep.
experimentalSpecimenRoleTypeThe type of role that the sample represents in the experiment.Positive experimental control
Field nameDescriptionExample
hostAgeThe age of the host at the time of sample collection.35
hostAgeBinThe age category of the host at the time of sampling.30-39
hostGenderThe gender of the host.Female
hostOriginCountryThe country of origin of the host.Switzerland
hostDiseaseThe name of the disease experienced by the host.mastitis, gastroenteritis
signsAndSymptomsA perceived change in function or sensation, (loss, disturbance or appearance) indicative of a disease, reported by a patient.cough; fever
hostHealthStateHealth status of the host at the time of sample collection.Asymptomatic
hostHealthOutcomeDisease outcome in the host.Recovered
travelHistoryTravel history in last six months.Canada, Vancouver; USA, Seattle; Italy, Milan
exposureEventEvent leading to exposure.Mass Gathering
hostRoleThe role of the host in relation to the exposure setting.Patient
exposureSettingThe setting leading to exposure.Healthcare Setting
exposureDetailsAdditional host exposure information.Host role - Other: Bus Driver
previousInfectionDiseaseThe name of the disease previously experienced by the host.COVID-19
previousInfectionOrganismThe name of the pathogen causing the disease previously experienced by the host.Sudden Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)
hostVaccinationStatusThe vaccination status of the host (fully vaccinated, partially vaccinated, or not vaccinated) - regarding current sampled pathogen.Fully Vaccinated
purposeOfSequencingThe reason that the sample was sequenced.Baseline surveillance (random sampling)
diagnosticMeasurementMethodType of diagnostic test performedRT-PCR test
diagnosticTargetPresenceBoolean value, if the diagnostic target was present.True
diagnosticTargetGeneNameThe name of the gene used in a diagnostic RT-PCR testE gene (orf4)
diagnosticMeasurementValueValue of the diagnostic test. Ex: The Ct value result from a diagnostic SARS-CoV-2 RT-PCR test21
diagnosticMeasurementUnitThe units of the diagnostic test performed`Cycle threshold (CT)
hostNameScientificThe scientific name of the host from which the sample was collected.Homo sapiens
hostNameCommonThe common name of the host from which the sample was collected.Human
hostTaxonIdTaxon ID for the host9606
Field nameDescriptionExample
sequencingDateThe date the sample was sequenced.2021-04-26
ampliconPcrPrimerSchemeThe specifications of the primers (primer sequences, binding positions, fragment size generated etc) used to generate the amplicons to be sequenced.https://github.com/joshquick/artic-ncov2019/blob/master/primer_schemes/nCoV-2019/V3/nCoV-2019.tsv
ampliconSizeThe length of the amplicon generated by PCR amplification.300bp
sequencingInstrumentThe model of the sequencing instrument used.Oxford Nanopore MinION
sequencingProtocolThe protocol used to generate the sequence.Genomes were generated through amplicon sequencing of 1200 bp amplicons with Freed schema primers. Ligation sequencing kit SQK-LSK109 (Oxford Nanopore Technologies) was used for library preparation.
sequencingAssayTypeThe overarching sequencing methodology that was used to determine the sequence of a biomaterial.whole genome sequencing assay
sequencedByOrganizationThe organization responsible for sequencing the sample.Public Health Agency of Canada (PHAC)
sequencedByContactNameThe name or title of the contact responsible for follow-up regarding the sequence.Enterics Lab Manager
sequencedByContactEmailThe email address of the contact responsible for follow-up regarding the sequence.enterics@lab.ca
rawSequenceDataProcessingMethodThe method used for raw data processing such as removing barcodes, adapter trimming, filtering etc.Porechop 0.2.3
dehostingMethodThe method used to remove host reads from the pathogen sequence.Nanostripper 1.2.3
referenceGenomeAccessionA persistent, unique identifier of a genome database entry.NC_045512.2
consensusSequenceSoftwareNameThe name of software used to generate the consensus sequence.Ivar
consensusSequenceSoftwareVersionThe version of the software used to generate the consensus sequence.1.3
depthOfCoverageThe average number of reads representing a given nucleotide in the reconstructed sequence.400
breadthOfCoverageThe percentage of the reference genome covered by the sequenced data, at a prescribed depth (depthOfCoverage)100
qualityControlMethodNameThe name of the method used to assess whether a sequence passed a predetermined quality control threshold.ncov-tools
qualityControlMethodVersionThe version number of the method used to assess whether a sequence passed a predetermined quality control threshold.1.2.3
qualityControlDeterminationThe determination of a quality control assessment.sequence failed quality control
qualityControlIssuesThe reason contributing to, or causing, a low quality determination in a quality control assessment.low average genome coverage
qualityControlDetailsThe details surrounding a low quality determination in a quality control assessment.CT value of 39. Low viral load. Low DNA concentration after amplification.

Other Optional Fields

Field nameDescriptionExample
authorsList of authors who should be listed on the sample, comma delimitedEmma B Hodcroft; John McDougal
authorAffiliationsList of author affiliations in the same order as authors, one entry per authorEVE Group, MPI Department, Swiss TPH, Basel, Switzerland; Anderson Group, IEB Department, School of Biology, University of Edinburgh, Edinburgh, Scotland
versionCommentReason for revising sequences or other general comments concerning a specific versionFixed an issue in previous version where low-coverage nucleotides were erroneously filled with reference sequence