Data Use Terms

Summaries of our Data Use Terms for Open Data and Restricted-Use Data are also available. Note that only the full Data Use Terms are used to interpret and arbitrate use.



These Terms were updated on 4 Aug 2024 and are the current and valid version.

These Terms constitute the legal agreement between you (a “User”) and Pathoplexus (“we”, “us”, “our”) (of Pathoplexus, Basel-Stadt, Switzerland), governing how Restricted-Use Data can be used and shared onward, as well as how it must be attributed when used.

This document is governed by the Pathoplexus Values and should be interpreted in light of the Pathoplexus Values. The Executive Board can modify and make changes to this document, in line with the purpose and commitments of the Pathoplexus Values, via 2/3 majority vote of the entire Board. If the Board has 5 members, this is interpreted as 4/5 votes in favor.

Pathoplexus aims to encourage data sharing by providing multiple options for sequence sharing, with submitters choosing whether to immediately provide data under open conditions, or to stipulate that its use is restricted for a period of time to mitigate concerns of “scooping” and use without appropriate acknowledgement. Pathoplexus believes that ethical use of all data is critical. This data is only available due to the hard work of data generators. Rapid data sharing, which allows the rapid assessment of pathogen characteristics and dynamics, will only be possible if trust is maintained that shared data will be used and acknowledged appropriately. By engaging in considerate, ethical, and fair use of the data shared with the community, Pathoplexus users can play an active role in building a community that fosters more data sharing.

Contents:

1. Definitions:

As used in Data Use Terms, the following terms shall have the following meanings:

1.1 The term “User” shall mean everyone who accesses the web service (https://pathoplexus.org) in any form

1.2 The term “Submitter shall mean those who submit data to Pathoplexus

1.3 The Term “Submitting Group” shall mean a group that a Submitter has submitted sequences on behalf of, and has control over all sequences submitted on behalf of that Submitting Group

1.4 The term “Curator” shall mean those who have elevated access in order to help detect and correct errors in the data

1.5 The term “SeqSet” shall mean collections of sequences indicated by their accession numbers, which provide unique identifiers that can be used to reference that set of sequences.

2. Data Submission and Protection

Users can submit sequences for supported pathogens with associated, non-sensitive metadata, (see the metadata fields we accept, and what data is sensitive) either without use restrictions (herein called “Open Data” for brevity) or with use restrictions (“Restricted-Use Data”) - both types of data together are referred to here as “Pathoplexus Data” or the “Data”.

The Data are made available to all Users, at no cost, on condition of acceptance of the Data Use Terms (upon accessing the Data in any form, Users agree to Data Use Terms. The Data Use Terms are further explicitly linked to within the metadata.

Restricted-Use Data can only be used within the conditions of the Data Use Terms (see below). Open Data is not subject to these terms, but should still be used ethically: data generators should be acknowledged and collaborations should be sought in some circumstances (see Open Data below). Users are required to read the Data Use Terms in detail and note applicable restrictions and expectations of notification, offers of collaboration, and acknowledgement that should be followed.

3. Onward Submission to INSDC

Open Data submitted to Pathoplexus is immediately submitted to INSDC on behalf of the original Submitters, where it becomes additionally available on all INSDC platforms.

Restricted-Use Data is displayed in Pathoplexus with a clear indication that its use is restricted, and is held under embargo at INSDC so that it is not accessible through the INSDC databases until the expiration of the Restricted-Use period of up to one year. Immediate embargoed submissions to INSDC allows Submitters to obtain accession numbers to be used in publication, whilst keeping their data subject to the Pathoplexus terms of use. After the Restricted Use period ends, Restricted-Use Data becomes Open Data.

4. Terms

4.1 Open Data

Pathoplexus expects correct acknowledgement and crediting of Open Data, via SeqSets and DOIs at a minimum, and through collaboration and co-authorship with sequence Submitters where appropriate. In particular, all efforts should be made to avoid “scooping” others’ work. This may include situations where you use extensive data from a country or region without involving and including any authors from that region or publishing analyses that may preclude the original data generators from publishing on their own data.

Publications and preprints using any form of data from Pathoplexus must provide the accession numbers for the sequences used. Pathoplexus strongly encourages using SeqSets (see section 4.4). It is also recommended to additionally list the INSDC accessions, to ensure data can be easily traced on both platforms.

4.1.1 Third Party Data Sharing

Data from Pathoplexus can be shared onward but the Data Use Terms must be clearly communicated, and any data distributed should include the Data Use Terms columns (dataUseTerms, dataUseTermsRestrictedUntil and dataUseTermsUrl) intact in the metadata. If displayed on a website or in another database, each sequence should have a direct link to the original sequence page on Pathoplexus, and display the Pathoplexus accession or link to an INSDC database and display the INSDC accession, if available on INSDC.

4.2 Restricted Data Use

Restricted-Use Data can only be used under the Data Use Terms outlined herein.

Data can remain ‘Restricted-Use’ for up to one year after submission. Submitters and Submitting Groups can set a shorter restricted-use period at submission, or choose to end, or shorten, the Restricted-Use period at any time. After this period ends, the data becomes Open Data.

4.2.1 Unpublished and Un-Preprinted Work

Data from Pathoplexus can be used for unpublished and un-preprinted work, such as, but not limited to: graphical representations, blog posts, social media, public health and governmental reports, and web programs and applications (in this case, see if 4.2.4 Third Party Data Sharing, below, is applicable).

  • For Restricted-Use Data, it must be acknowledged by providing a list of accession numbers used (or using a SeqSet which will contain all this data) and linking back to Pathoplexus so that users can view the original data and submitters. If all sequences used are from one institution, this institution and the leading author(s) should be credited explicitly. If more than 5 sequences are used, it’s highly recommended to create a SeqSet to easily link to all data used.

4.2.2 Publications and Preprints

In scientific publications and preprints, Restricted-Use data can often only be used with explicit permission of the Submitting Group. It is vital that you read the conditions of use below.

Pathoplexus believes it is important that people who generate the sequences have the opportunity to complete and publish the analysis they intend with the context and expertise they possess. Thus, we provide guidelines to prevent Users from “telling the submitters’ story” when using others’ data. To aid in this interpretation we have created the categories of “Focal Set” and “Background Set”.

Users utilizing data from Pathoplexus in publications and preprints must create a SeqSet (see section 4.4) containing the Pathoplexus accession numbers and generate a DOI, dividing their sequence and metadata into two groups:

  • The “Focal Set” is defined as sequences that are key to the analysis and resulting conclusions of the manuscript - these sequences cannot be left out or replaced without changing the analysis significantly.
  • The “Background Set” is defined as sequences that provide context or background, but could be replaced or (partially) removed, without changing the results. In the circumstance where all available data is being used, the entire set may be equivalent to a “Background Set” - see Using All Data, below.

For more a more detailed description of how to divide data into “Focal” and “Background Sets”, see 4.2.3 ‘Deciding’, below.

The requirements for using Restricted-Use data in a Focal or Background Set differ - please read carefully.

Requirements for use of Restricted-Use Data in publications and preprints:

  • If Restricted-Use Data is part of a Focal Set, you must satisfy the three points below
  1. One or more Submitters must be included as an author on the manuscript (as agreed with the Submitting Group) (“Authorship”) or The authors of the manuscript should include, as a supplemental document to the manuscript, explicit written permission of the Submitting Group, giving permission to publish and turning down Authorship (“Authorship Waiver”).

  2. The authors must create a SeqSet containing all Focal Set data from Pathoplexus, generate a DOI, and cite this DOI in the manuscript as a reference.

  3. Add the following statement to the Acknowledgement section of the manuscript: “We confirm that we have adhered to the Data Use Terms of Pathoplexus.”

  • If Restricted-Use Data is part of a Background Set:
    • Very carefully consider whether your Restricted-Data can, and should, be considered Background Set by reading the 4.2.3 ‘Deciding’ guidelines below.
    • Create a SeqSet containing all Background Set data from Pathoplexus, generate a DOI, and be sure to cite this DOI in the manuscript as a reference.

4.2.3 Deciding if Restricted-Use Data is part of a Focal or Background Set

When deciding whether data should be part of the Focal Set, the intent of a Focal Set should be interpreted broadly: this is data without which the work would not be possible. Data that is part of a Focal Set is thus data that is critical to the analysis - it could not be removed or replaced with a randomly selected similar set without changing the results significantly and thus should be acknowledged appropriately.

Focal Set - if you answer yes to any of these, this data should be part of your “Focal Set”:

  • If I removed these sequences from my analysis (but kept the Background Set), would the analysis be impossible and/or the result be completely different?
  • If I removed these sequences and replaced them with sequences from another country/time-period would my analysis not exist or be completely different?
  • Do I mention these accession IDs or sequence names in my paper explicitly?
  • Are these sequences from a specific geographic area that my analysis focuses on (either on purpose or by chance - ex: the location where a new variant/strain originated or where a mutation of interest is present, which is the subject of the manuscript)?

The intent of data in a Background Set is to provide context to the Focal Set. Including data in the Background Set implies that this data could be replaced with other data to a reasonable degree without impacting the analysis. Any data for which this is not true should be part of the Focal Set.

Background Set - if you answer yes to all of these, the data may be acceptable to include in a “Background Set”:

  • If I removed these sequences from my analysis but replaced them with other, randomly selected sequences from the entire dataset, would my analysis still be meaningful and the result the same (even if weaker or less clear)?
  • If using sequences from multiple countries/regions, could I remove any one country/region from this analysis without affecting my main results? (If this is true for some geographical areas but not others, the ones for which it is not true should be part of your “Focal Set”)

If unsure whether data should be in the Focal or Background Set, it is best practice to consider the data part of a Focal Set.

4.2.4 Third Party Data Sharing

Restricted-Use Data from Pathoplexus can be shared onward but the Data Use Terms must be clearly communicated, and any data distributed must include the Data Use Terms columns (dataUseTerms, dataUseTermsRestrictedUntil and dataUseTermsUrl) intact in the metadata. If displayed on a website or in another database, each sequence must have a direct link to the original sequence page on Pathoplexus and display the Pathoplexus accession.

The focus of Pathoplexus is on open availability of data. If Users share Restricted-Use Pathoplexus data onward as a database or as part of a database it must be under the same circumstances as it has been provided to User: without access restrictions. An exception is made for private use for small groups or labs with up to 200 users. If Users wish to use Pathoplexus data in an access-restricted database with more than 200 users, Users must contact Pathoplexus to request permission (help@pathoplexus.org). We generally support onward sharing for collaborative use (such as access within an institution for their employees, or sharing within a collaboration for joint downstream analysis) and encourage you to reach out. Requests will be considered and decided by the Executive Board.

4.3 Using All Data

In some cases there are a large number of data submitters that have contributed to the database and in these cases, analyses and applications that use all data from the database can consider this data as “Background Set” without explicit involvement of the submitters. In other cases, where only a small number of groups have contributed, it would not be ethical to use all data without contacting and involving the submitting groups.

If a large number of submitters have contributed to the data, and you can answer yes to one of the scenarios below, it may be appropriate to treat your entire dataset as a “Background Set” and acknowledge it appropriately.

Examples of what would be considered appropriate use of All Data:

  • Using every available sequence in order to look at overarching sequence properties (ex: frequencies of a mutation) that are found globally or across continents
  • Using every sequence available in order to speak to only global/international trends with no specific focus on a country or region, especially one where a virus/variant/strain/clade originated or where a mutation of interest is only found

4.4 How to Create a SeqSet

Please see our how-to here for how to create a SeqSet.

If you are producing a publication or preprint, you MUST cite the DOI in the References section of your manuscript (as if it was another paper or resource) so that it is documented by CrossRef and the paper can be linked to the sequences used.

5. How to Check a SeqSet

(Particularly for Editor, Reviewers, & Readers)

We appreciate and value the efforts of publishers to encourage and promote the ethical behavior of publishing scientists, by checking for any restrictions on data they are publishing, as well as the geographical distribution of the data - particularly for the focal set.

Anyone can easily check a SeqSet by following the pathoplexus.org SeqSet link or the DOI link provided to you.

Editors and reviewers are always encouraged to reach out to authors in order to better understand their choice in sequences, authors, and focal/background split. Having written guidelines guiding data use is relatively new, and many authors may genuinely misunderstand or misinterpret them, and be happy to rectify issues that fall foul of the Data Use Terms.

Things to consider for publishers, reviewers, and others checking DOI sets:

For a Focal Set:

  • Are any of the sequences Restricted Use Data? If so, they should have at least one Submitter of these sequences in their authorship list, or a letter giving permission to publish and waiving authorship. Editors and reviewers are encouraged to consider whether any submitting authors included are a fair representation, considering the analysis and the data used (does this seem like a genuine collaboration effort, or simply inclusion of a ‘token author’?) and to seek clarification before acceptance if helpful.
  • What is the geographic distribution of the data, and does it match the geographic distribution of the authorship of the manuscript? If all authors are from or based in a region that is separate from where the majority of the focal data stems, is the data being used ethically? Editors and reviewers are encouraged to seek clarification in cases where no submitting, generating, or locally-involved persons have been included in the analysis.
  • Does the Focal Set fully encompass the sequences that are the focal point of the manuscript? Does the total number of sequences in the Focal set and their geographic distribution match what they reference in the manuscript? Is there any possibility that their key analysis relies on sequences not included in the Focal set, and if so, why have they been excluded from the Focal set?

For Background Set:

  • Are any of the sequences Restricted Use Data? If so, they should only be included as background sequences if they pass the questions above. If not, they should be included in the Focal set and treated appropriately.
  • How is the Background dataset used? Is the Background data truly playing only a contextualizing role, or have the authors focused on any sequences or countries within the Background set? If so, these sequences need to be part of the Focal Set.
  • What is the geographic and time distribution of the data? Is the Background Set more broadly spread than the focal set? If the Background Set doesn’t seem to substantially differ from the Focal Set, should it be part of the Focal Set?