Metadata format
Here you can find a brief description and example of each field you can include in the metadata file uploaded with your samples.
Most of these fields are currently optional, but we do have some required fields.
Some fields, like dates, countries and authors, will be standardized so that all entries in the database are easy to process.
At the moment, these are the only fields we standardize, and we will keep users updated if that changes in the future.
Please format dates as YYYY-MM-DD.
To submit sequences, the full date (including day) is required. If you have older sequences for which the full date isn’t available, or legal reasons relating to releasing the full date, please get in touch via submission@pathoplexus.org.
Please format authors as a string where each author is separated using a semi-colon, and a comma is used to separate first names/initials from last name, e.g. last name, first name;
. Last name(s) is mandatory, only ASCII alphabetical characters A-Z are allowed. For example: Smith, Anna; Perez, Tom J.; Xu, X.L.;
or Xu,;
if the first name is unknown.
You can download all metadata fields and their descriptions here: metadata_fields_descriptions.csv
Required Fields:
Field name | Description | Example |
---|
submissionId | Your sequence identifier; should match the FASTA file header - this is used to link the metadata to the FASTA sequence | GJP123 |
sampleCollectionDate | The date on which the sample was collected. Please format YYYY-MM-DD - use XX if unknown, ex: 2020-03-XX or 20XX-XX-XX , and provide at least year | 2020-03-15 |
geoLocCountry | The country from which the sample was collected. | Canada |
Optional Fields:
Desired Fields:
Field name | Description | Example |
---|
geoLocAdmin1 | A local administrative region from which the sample was collected (ex: Province, State, Canton) | Ontario |
geoLocAdmin2 | A local administrative region from which the sample was collected (ex: county or municipality) | City of Toronto |
geoLocCity | The city from which the sample was collected. | Toronto |
specimenCollectorSampleId | If there is another sample ID attached to the sequence, it can be recorded here (nothing identifiable!) | SWG_1001-1B |
authors | List of authors who should be listed on the sample. Authors should be separated by semi-colons. Each author’s name should be in the format last name, first name; . Last name(s) is mandatory, only ASCII alphabetical characters A-Z are allowed. | Smith, Anna; Perez, Tom J.; Xu,; |
authorAffiliations | List of author affiliations in the same order as authors, one entry per author | EVE Group, MPI Department, Swiss TPH, Basel, Switzerland; Anderson Group, IEB Department, School of Biology, University of Edinburgh, Edinburgh, Scotland |
cultureId | An ID useful to linking to a culture | Plate7-1124 |
sampleReceivedDate | The date on which the sample was received by the laboratory. | 2020-03-20 |
collectionDevice | The instrument or container used to collect the sample e.g. swab. | Swab |
collectionMethod | The process used to collect the sample e.g. phlebotomy, necropsy. | Bronchoalveolar lavage (BAL) |
sequencingDate | The date the sample was sequenced. | 2021-04-26 |
sequencingInstrument | The make and model of the sequencing instrument/platform used. | Oxford Nanopore MinION, Illumina NGS Platforms |
sequencingProtocol | The protocol/program used to generate the sequence. | Genomes were generated through amplicon sequencing of 1200 bp amplicons with Freed schema primers. Ligation sequencing kit SQK-LSK109 (Oxford Nanopore Technologies) was used for library preparation. |
depthOfCoverage | The average number of reads representing a given nucleotide in the reconstructed sequence. | 400 |
hostNameScientific | The scientific name of the host from which the sample was collected. | Homo sapiens |
hostTaxonId | Taxon ID for the host | 9606 |
versionComment | Reason for revising sequences or other general comments concerning a specific version | Fixed an issue in previous version where low-coverage nucleotides were erroneously filled with reference sequence |
Field name | Description | Example |
---|
geoLocAdmin1 | A local administrative region from which the sample was collected (ex: Province, State, Canton) | Ontario |
geoLocAdmin2 | A local administrative region from which the sample was collected (ex: county or municipality) | City of Toronto |
geoLocCity | The city from which the sample was collected. | Toronto |
geoLocSite | The name of a specific geographical location e.g. Credit River (rather than river). | Credit River |
geoLocLatitude | Geo-coordinate latitude in decimal degree (WGS84) format, i.e. values in range -90 to 90, where positive values are north of the Equator. | -34.603722 (latitude of Buenos Aires) |
geoLocLongitude | Geo-coordinate longitude in decimal degree (WGS84) format, i.e. values in range -180 to 180, where positive values are east of the Prime Meridian. | -58.381592 (longitude of Buenos Aires) |
specimenCollectorSampleId | If there is another sample ID attached to the sequence, it can be recorded here (nothing identifiable!) | SWG_1001-1B |
cultureId | An ID useful to linking to a culture | Plate7-1124 |
isLabHost | If a laboratory host (e.g. cultured cell line) was used to propagate the sample. | True |
cellLine | Population of cells derived from a single cell or group of cells, which are cultured in the laboratory under controlled conditions | |
passageNumber | The number of times a cell culture has been subcultured or transferred from one vessel to another. | 12 |
passageMethod | The techniques used to subculture cells | Enzymatic Detachment |
sampleReceivedDate | The date on which the sample was received by the laboratory. | 2020-03-20 |
sampleType | Method of sampling. | Nasopharyngeal Swabs |
purposeOfSampling | The reason that the sample was collected. | Diagnostic testing |
presamplingActivity | The activities or variables introduced upstream of sample collection that may affect the sample collected. | Antimicrobial pre-treatment |
anatomicalMaterial | A substance obtained from an anatomical part of an organism e.g. tissue, blood, saliva, fluid (Cerebrospinal (CSF), Pericardial, Pleural, Vaginal, Amniotic) | Blood |
anatomicalPart | The anatomical part of the organism the sample was taken from e.g. oropharynx. | Nasopharynx (NP) |
bodyProduct | A substance excreted/secreted from an organism e.g. feces, urine, sweat. | Feces |
environmentalMaterial | A substance obtained from the natural or man-made environment e.g. soil, water, sewage, door handle, bed handrail, face mask. | Face mask |
environmentalSite | An environmental location may describe a site in the natural or built environment e.g. hospital, wet market, bat cave. | Hospital |
collectionDevice | The instrument or container used to collect the sample e.g. swab. | Swab |
collectionMethod | The process used to collect the sample e.g. phlebotomy, necropsy. | Bronchoalveolar lavage (BAL) |
foodProduct | A material consumed and digested for nutritional value or enjoyment. Include animal feed. | Bone meal , Chicken breast |
foodProductProperties | Any characteristic of the food product pertaining to its state, processing, a label claim, or implications for consumers. | Food (chopped) |
specimenProcessing | The processing applied to samples post-collection, prior to further testing, characterization, or isolation procedures. | Samples pooled |
specimenProcessingDetails | Detailed information regarding the processing applied to a sample during or after receiving the sample. | 25 swabs were pooled and further prepared as a single sample during library prep. |
experimentalSpecimenRoleType | The type of role that the sample represents in the experiment. | Positive experimental control |
Field name | Description | Example |
---|
hostAge | The age of the host at the time of sample collection. | 35 |
hostAgeBin | The age category of the host at the time of sampling. | 30-39 |
hostGender | The gender of the host. | Female |
hostOriginCountry | The country of origin of the host. | Switzerland |
hostDisease | The name of the disease experienced by the host. | mastitis , gastroenteritis |
signsAndSymptoms | A perceived change in function or sensation, (loss, disturbance or appearance) indicative of a disease, reported by a patient. | cough; fever |
hostHealthState | Health status of the host at the time of sample collection. | Asymptomatic |
hostHealthOutcome | Disease outcome in the host. | Recovered |
travelHistory | Travel history in last six months. | Canada, Vancouver; USA, Seattle; Italy, Milan |
exposureEvent | Event leading to exposure. | Mass Gathering |
hostRole | The role of the host in relation to the exposure setting. | Patient |
exposureSetting | The setting leading to exposure. | Healthcare Setting |
exposureDetails | Additional host exposure information. | Host role - Other: Bus Driver |
previousInfectionDisease | The name of the disease previously experienced by the host. | COVID-19 |
previousInfectionOrganism | The name of the pathogen causing the disease previously experienced by the host. | Sudden Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) |
hostVaccinationStatus | The vaccination status of the host (fully vaccinated, partially vaccinated, or not vaccinated) - regarding current sampled pathogen. | Fully Vaccinated |
purposeOfSequencing | The reason that the sample was sequenced. | Baseline surveillance (random sampling) |
diagnosticMeasurementMethod | Type of diagnostic test performed | RT-PCR test |
diagnosticTargetPresence | Boolean value, if the diagnostic target was present. | True |
diagnosticTargetGeneName | The name of the gene used in a diagnostic RT-PCR test | E gene (orf4) |
diagnosticMeasurementValue | Value of the diagnostic test. Ex: The Ct value result from a diagnostic SARS-CoV-2 RT-PCR test | 21 |
diagnosticMeasurementUnit | The units of the diagnostic test performed | `Cycle threshold (CT) |
hostNameScientific | The scientific name of the host from which the sample was collected. | Homo sapiens |
hostNameCommon | The common name of the host from which the sample was collected. | Human |
hostTaxonId | Taxon ID for the host | 9606 |
Field name | Description | Example |
---|
sequencingDate | The date the sample was sequenced. | 2021-04-26 |
ampliconPcrPrimerScheme | The specifications of the primers (primer sequences, binding positions, fragment size generated etc) used to generate the amplicons to be sequenced. | https://github.com/joshquick/artic-ncov2019/blob/master/primer_schemes/nCoV-2019/V3/nCoV-2019.tsv |
ampliconSize | The length of the amplicon generated by PCR amplification. | 300bp |
sequencingInstrument | The model of the sequencing instrument used. | Oxford Nanopore MinION |
sequencingProtocol | The protocol used to generate the sequence. | Genomes were generated through amplicon sequencing of 1200 bp amplicons with Freed schema primers. Ligation sequencing kit SQK-LSK109 (Oxford Nanopore Technologies) was used for library preparation. |
sequencingAssayType | The overarching sequencing methodology that was used to determine the sequence of a biomaterial. | whole genome sequencing assay |
sequencedByOrganization | The organization responsible for sequencing the sample. | Public Health Agency of Canada (PHAC) |
sequencedByContactName | The name or title of the contact responsible for follow-up regarding the sequence. | Enterics Lab Manager |
sequencedByContactEmail | The email address of the contact responsible for follow-up regarding the sequence. | enterics@lab.ca |
rawSequenceDataProcessingMethod | The method used for raw data processing such as removing barcodes, adapter trimming, filtering etc. | Porechop 0.2.3 |
dehostingMethod | The method used to remove host reads from the pathogen sequence. | Nanostripper 1.2.3 |
referenceGenomeAccession | A persistent, unique identifier of a genome database entry. | NC_045512.2 |
consensusSequenceSoftwareName | The name of software used to generate the consensus sequence. | Ivar |
consensusSequenceSoftwareVersion | The version of the software used to generate the consensus sequence. | 1.3 |
depthOfCoverage | The average number of reads representing a given nucleotide in the reconstructed sequence. | 400 |
breadthOfCoverage | The percentage of the reference genome covered by the sequenced data, at a prescribed depth (depthOfCoverage ) | 100 |
qualityControlMethodName | The name of the method used to assess whether a sequence passed a predetermined quality control threshold. | ncov-tools |
qualityControlMethodVersion | The version number of the method used to assess whether a sequence passed a predetermined quality control threshold. | 1.2.3 |
qualityControlDetermination | The determination of a quality control assessment. | sequence failed quality control |
qualityControlIssues | The reason contributing to, or causing, a low quality determination in a quality control assessment. | low average genome coverage |
qualityControlDetails | The details surrounding a low quality determination in a quality control assessment. | CT value of 39. Low viral load. Low DNA concentration after amplification. |
Other Optional Fields
Field name | Description | Example |
---|
authors | List of authors who should be listed on the sample, comma delimited | Emma B Hodcroft; John McDougal |
authorAffiliations | List of author affiliations in the same order as authors, one entry per author | EVE Group, MPI Department, Swiss TPH, Basel, Switzerland; Anderson Group, IEB Department, School of Biology, University of Edinburgh, Edinburgh, Scotland |
versionComment | Reason for revising sequences or other general comments concerning a specific version | Fixed an issue in previous version where low-coverage nucleotides were erroneously filled with reference sequence |