Data Submission and Processing

These Terms were updated on 4 Aug 2024 and are the current and valid version.

These Terms constitute the legal agreement between you (a “User”) and Pathoplexus (“we”, “us”, “our”) (of Pathoplexus, Basel-Stadt, Switzerland), governing what data can be submitted to Pathoplexus and how it is processed and can be changed.

This document is governed by the Pathoplexus Values and should be interpreted in light of the Pathoplexus Values. The Executive Board can modify and make changes to this document, in line with the purpose and commitments of the Pathoplexus Values, via 2/3 majority vote of the entire Board. If the Board has 5 members, this is interpreted as 4/5 votes in favor.

Contents:

1. Definitions

As used in Database Terms of Service, the following terms shall have the following meanings:

1.1 The term “User” shall mean everyone who accesses the web service (https://pathoplexus.org) in any form

1.2 The term “Submitter shall mean those who submit data to Pathoplexus

1.3 The Term “Submitting Group” shall mean a group that a Submitter has submitted sequences on behalf of, and has control over all sequences submitted on behalf of that Submitting Group

1.4 The term “Curator” shall mean those who have elevated access in order to help detect and correct errors in the data

2. Data Submission and Access

Submitters can submit sequences to Pathoplexus for supported pathogens with associated, non-sensitive metadata, either with time-limited restrictions on its use (“Restricted-Use Data”) or openly (“Open Data”). Open Data (whether open from submission or becoming open after the period of restriction ends) is also submitted by Pathoplexus to INSDC (via ENA), where it becomes additionally available on all INSDC platforms. Open Data also includes data pulled in from INSDC for supported pathogens to make it available to all Users alongside Restricted-Use Data.

Restricted-Use Data may also be submitted to INSDC soon after submission to Pathoplexus, but is held under Embargo (see ENA data availability policies here and here) until it becomes Open Data; this means it is not visible on any INSDC platform until it becomes Open Data.

Restricted-Use Data is available only within Pathoplexus, and can only be used in accordance with the Data Use Terms. It is critical that all Users familiarize themselves with the Data Use Terms, and in particular requirements for authorship and acknowledgement. Pathoplexus expects ethical use of all data (see Data Use Terms).

Users can specify the time period for which data is “Restricted-Use Data”, up to a maximum of one year, and also manually release data from Restricted Use before the time period expires. When the Restricted Use period has ended, Restricted-Use Data becomes Open Data within this database, and is released from embargo on INSDC (making it visible on INSDC platforms).

All data within Pathoplexus must be used ethically, and in accordance with scientific and community etiquette (see Data Use Terms), and accession numbers should always be provided (see Data Use Terms).

3. Data Processing

Pathoplexus processes submitted data to validate, harmonize, and standardize the data and maximize utility. After initial submission and processing, data is available as version 1, and subsequent changes are reflected in incremented versions. By default, the latest version of all data is displayed unless another version is explicitly requested

To allow Users maximum flexibility in accessing data that is most useful for them, only data without the minimum required metadata values and sequences that fail to be identified (via seed-matching) as specified pathogen are rejected.

Data processing includes measures such as:

Sequences:

  • Checks that the sequence matches the virus specified by the user
  • Alignment with the reference genome
  • Translation of genes/coding regions
  • Quantification of mutations in number and type (both nucleotide and AA)
  • Labeling of deletions and insertions
  • Quality control scoring
  • Assignment of clades/lineages (where applicable)

Metadata:

  • All dates are converted to standard ISO format
  • Location information is standardized

4. Data Fields

You can see a full list of the data fields we support here.

5. Persistence

Pathoplexus is not designed to be the sole final resting place for sequence data. Instead, it is designed to add value to existing public data and to facilitate sharing of additional data. In particular, it provides easy access to datasets via API, allows easy upload of data to public repositories, and provides a temporary, protected holding-place for data allowing data generators time to analyze and publish their work. Thus, Users and Submitters should understand that all submitted sequences will be submitted onward to INSDC databases eventually (after a maximum of one year). This allows data to persist and be reused by others in a sustainable, long-term framework.

6. Curation

Curation is a process to flag and rectify errors in sequences and metadata. This process is performed by Users with elevated access: Curators. You can find a list of our current Curators here. This is linked to the process of Revision.

Curators follow a well-described procedure, which can be viewed here, in deciding when to flag errors and when to suggest, and accept, revisions. As also described in the Revision section, this works differently depending on whether data was uploaded directly to Pathoplexus, or was obtained from an INSDC database.

In either case, an error is only considered likely if two curators independently agree that the evidence presented in our ‘Curation reports’ Github repository strongly supports the error.

At this point, the data is flagged as being ‘questioned raised’. This flag will remain until a Curator is satisfied that the issue is resolved.

In the case where the error cannot be corrected (it’s clear there’s an error but unclear what the correct data is), the sequence will only be flagged. In cases where the error cannot be corrected and the error may cause confusion or be misleading, the Curator can propose to Revoke the sequence. If another Curator agrees, the sequence will be Revoked.

7. Revision

By submitting group:

Submitters, and the Submitting Group, can correct a submission anytime by submitting a Revision. Revisions by submitters can edit either the metadata or the sequence itself, and allow incorrect data to be fixed, data to be improved (perhaps a better-quality sequence or changing a date from year-only to year-month-day), or data to be added (perhaps symptom information has become available). Revisions of any kind increment the version of a sequence, and the previous states of the data are preserved in previous revisions, which can be accessed at any time.

You can find out how to submit a Revision here.

By curator:

Curators can also submit Revisions, which work slightly differently depending on whether the data was submitted directly to Pathoplexus or whether it comes from an INSDC database.

For data submitted directly to Pathoplexus, curators can flag suspected errors (if two curators agree an error is likely to exist), and propose Revisions. However, only Users from the Group that submitted these sequences can accept any proposed Revisions. Directly submitted sequences are never Revised by Pathoplexus without the Submitting Group accepting the Revision.

For data directly uploaded to Pathoplexus:

Curators will not accept Revisions for directly-uploaded data - only the Submitter and Submitting Group can accept such Revisions. A Curator will flag the sequence(s) in question, and prepare the Revision, and the Submitting Group will be notified. They can accept or reject the Revision. Note that if the Submitting Group reject the Revision, the data will remain ‘flagged’ unless they submit another Revision that a Curator feels corrects the error.

For data received from an INSDC database:

Once one Curator prepares the Revision, a separate Curator can accept it, resolving the error and removing the flag.

More detail about how data can be edited and curated is available here. See also the section on Curation, above.

Anyone can flag a suspected error to the attention of Curators, by detailing the nature of the possible error and the supporting evidence as a Github issue in ‘Curation Reports’ (preferred) or in an email to submissions@pathoplexus.org.

8. Revocation and Deletion

If a sequence is incorrect, and cannot immediately be corrected (especially if the error may be misleading or cause confusion), the sequence should be Revoked.

8.1 Revocation

Groups that have submitted data may Revoke a sequence at any time, if there are errors which cannot be corrected. We ask this be done only in situations where the sequence cannot be corrected via a Revision (see above) immediately.

Some (non-exhaustive) examples of when a sequence should be Revoked:

  • The sequence is an artificial (lab-based) recombination or contamination that could or will lead to misinformation or confusion (during the SARS-CoV-2 pandemic, “Deltacron”)
  • In the same vein as the above, the sequence is bad quality in any manner that may lead to misinformation or confusion

In all of these cases, a Revision could be submitted instead, if it is available immediately and will rectify the problem. However, if a new version of the sequence or metadata is not yet available (or will never become available), please Revoke the sequence. It is always possible to revert the Revocation by submitting a Revision later, with the correct information.

Note that Revocation does not delete the sequence from the database - it is still visible and can be downloaded. However, it is clearly marked as Revoked, is by default excluded from searches, and will not be included in any default data downloads. It is always possible to revert the Revocation by submitting a Revision later, with the correct information.

You can read how to revoke a sequence here.

8.2 Deletion:

Generally, Pathoplexus does not permit the permanent removal of sequences, since this is intransparent and causes confusion when a record cannot be traced. Thus, this can only be done through the intervention of a Pathoplexus administrator and under very specific circumstances.

Some cases where this may be permitted:

  • The sequences contain clinically sensitive or identifiable information
  • The sequences contain human reads
  • The sequences were uploaded without permission of those who generated them

If you wish to request a consideration of having sequences permanently removed, please contact submissions@pathoplexus.org.