Download App C H3Africa EGA Submission Guidelines 13092013

Transcript
European Genome-phenome Archive (EGA)
General Information
1) What is the EGA?
The EGA1 is a centralised repository that permanently stores and provides user access to all types of
genetic and phenotypic biomedical research data which are personally identifiable. The EGA archives
data from study participants who have expressly provided their consent agreement to only release
data for specific research use and to bona fide researchers. Very strict policies and protocols are in
place which determines how the data is managed, stored and distributed. Only members of the EGA
are allowed to process data on site and all data is encrypted with the encryption keys / codes.
2) EGA System and Security
The EGA is a secure computing facility that uses a shared EBI setup for data submissions and has a
petabyte scale archive for original data files and data access via the website. All data outside this
secure EGA site is encrypted. EBI web team has implemented strict web protocols that include
phpBB for https logins and also maintain the file archive and data submission parts. EGA itself is not
visible to any EBI network machine and only to EGA staff.
The EGA data archive is modular to provide high security and performance for large archived
datasets. Individuals, their samples and phenotypes are stored in separate databases. Experimental
data are also stored in separate databases depending on the type of data. The EGA short read
archive is only for raw sequence data. Processed data types must be submitted separately to the
EGA and are stored in dedicated databases.
3) How is data accessed from the EGA?
The EGA implements a distributed access granting policy where the decision to grant access to the
data resides with the relevant consortium data access committee (DAC). In order to access the data
in the EGA, a scientist has to apply in writing to the DAC. The DAC is primarily responsible for
granting and making all data access decisions and not the EGA who merely facilitates the secure
archiving and hosting of the data. The DAC is composed of members from the organisation that
produced the data and not EGA personnel.
The Data Access Agreement (DAA) is a contract made directly between the relevant DAC and
applicant wishing to access the data. The EGA will only provide access to the data once a successful
application process has been passed onto the EGA from the DAC.
1
https://www.ebi.ac.uk/ega/node/66
EGA provides encryption keys / codes physically and offline after one is granted access to a particular
dataset / studyby approval from a DAC.These can be used to obtain web access.
The EGA provides support for consortium members to access the data before publication by allowing
access to only the consortium data and study websites by means of authorised secure logins. All
consortium members have access to the consortium web page which is used to collate data on
consortium projects in the EGA. The pre-publication phase for this is usually between 6 – 12 months.
4) Consortium Specific Website
The EGA creates a separate website for each consortium that deposits data within its system. The
website provides information about the consortium, a link back to that consortium’s website and a
list of archived studies by that consortium.
Each study is assigned a stable identifier that can be referred to in publications. Authorised data
access requires a user to login and files designated for distribution will be encrypted and moved to a
dedicated disk area outside the EGA. Each file is made available to only those users that are granted
access to the data by the DAC. Hyperlinks to download data are available in the secure login website
to approved users only and a script will verify the access to the data before downloads begin. For
large datasets web download is not optimal and EGA enable temporary FTP access via as Aspera
account. The Aspera account requires a password and can only be used by one person at a time.
5) Data Submission
Before the EGA can accept any data submissions the following policy documentation is required to
be submitted with the data: Data access agreement, data access application form and a Policy
statement (examples are provided in submission pack when contacting the EGA for submitting data).
EGA only accepts de-identified data with a DAC approved plan. Accepted data types include
manufacturer specific raw data formats from the array based and new sequencing platforms. The
processed data which may include genotype, structural variants or any summary level statistical
analyses from the original study authors are stored in databases. The EGA will also accept and
provide access to any phenotype data associated with the data. Prior to submission, the submitters
must contact the EGA.
EGA Data Types and Formats for Submission
1) What type of data does the EGA Accept?
The EGA accepts2 :
a) Sequence data (raw/ unaligned and analysis/aligned), including RNASeq, resequencing, epigenomics, transcriptomics and other sequence based assays3.
2
3
https://www.ebi.ac.uk/ega/submission
https://www.ebi.ac.uk/ega/submission/sequence
b) Array based data (genotypes, SNP, Expression) and their associated phenotypic
information4.
c) Analysis based submissions.
2) What File formats does EGA Accept?
a) Sequence Based Submissions
i.
Sequences
BAM format is the preferred EGA option. All BAM files submitted must be able to be read with
SAMtools and Picard. Currently BAM files must be de-multiplexed prior to submission (plans to
accept submission of BAM files with reads from multiple samples are in the pipeline).
Table 1: Summary of information on supported file formats for EGA Sequence based data
Supported
BAM – most preferred
SFF for 454 data
Convert Illumina Scarf Format to FastQ
Convert Illuminaqseq to FastQ format before
submission
FastQ format
PacBio HDF5 format
Not Supported
Colour spaced BAM
Files not de-multiplexed before submission
Signal data for Illumina GA / Hiseq and SOLiD
Platforms
Scarf Format from Illumina (not greatly
supported)
Illuminaqseq format
Pooled data must be de-multiplexed and
barcoded before submission
Complete Genomics data (with some caveats –
see below)
Reference alignments – BAM format
Sequence Variations – VCF format
Note:
 Colour spaced BAM files are not supported. Data files have to be de-multiplexed before
submission so that each run is submitted with files containing data for a single sample only.
 Signal data is no longer accepted for Illumina GA/Hiseq and SOLiD platforms but continues to
be supported for the 454 platform. The minimum submission level for EGA is base/colour
calls with quality scores.
 As BAM is near optimal in terms of compression, files should be submitted uncompressed.
 For 454 data the EGA accepts SFF which are also compressed and should be submitted
uncompressed.
 Illumina Scarf Format – EGA will accept but are not keen on it as these submissions cannot
be processed or made available in other formats.
4
https://www.ebi.ac.uk/ega/submission/array_based





EGA requires one to convert Illumina Scarf Format data to Fastq prior to submission, and to
convert those scarf format logs-odds qualities to Phred qualities when preparing the FastQ
files for submission.
Illuminaqseq format – accept but not happy with this format for the same reasons as the
scarf format above. Submitters should convert from qseq to fastq format.
Data submissions in PacBio HDF5 format are accepted.
“Complete Genomics data should be submitted using the intact Complete Genomics
directory tree structure containing the ASM, LIB and MAP subfolders. Each individual
genome should be submitted as a single Run object associated with a single Experiment and
Sample object. Please note that the reads and mappings in the MAP directory should be
included in the submission.”
If submitting Fastq files then the compression algorithm to use is gzip or bzip2.
For FastQ format, primary sequence data submissions of single and paired reads are accepted
as Fastq files that meet the following the requirements:
 Quality scores must be in Phred scale. For example, quality scores from early Solexa
pipelines must be converted to use this scale. Both ASCII and space delimitered decimal
encoding of quality scores are supported. We will automatically detect the Phred quality
offset of either 33 or 64.
 No technical reads (adapters, linkers, barcodes) are allowed.
 Single reads must be submitted using a single Fastq file and can be submitted with or
without read names.
 Paired reads must split and submitted using either one or two Fastq files. The read names
must have a suffix identifying the first and second read from the pair, for example '/1' and
'/2' (regular expression for the reads "^(.*)([\\.|:|/|_])([12])$").
 The first line for each read must start with '@'.
 The base calls and quality scores must be separated by a line starting with '+'.
 The Fastq files must be compressed using gzip or bzip2.
Example of Fastq file containing single reads:
@read_name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...
Example of Fastq file containing paired reads:
@read_name/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@read_name/2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...
where<cycle> indicates the cycle number that starts the second read.The Fastq files should be
compressed using gzip.
ii.
Secondary analysis formats
The EGA supports 2 types of analyses, reference alignments in BAM format and sequence variations
in VCF format.
b)
Array Based Submissions
The EGA supports submission of processed data from all types of array based technologies such as
genotype, gene expression, methylations etc.EGA also archives any associated phenotypic data.
EGA does not provide any more information other than what is above on files accepted for Array
based submissions in their online user manual.
c)
Analysis Based Submissions
These include genotypes (summary or aggregate), structural variants (VCF), expression and
phenotype5. Accepted data types include raw data manufacturer specific formats from array based
and NGS platforms and processed data such as genotypes, structural variants, aligned reads.
Key Stages of EGA Submission
1) Sequence Based Submissions to the EGA
There are 4 key stages for the submission of sequence based data according to the EGA manual
(https://www.ebi.ac.uk/ega/submission/manual):
a) Contact EGA
Contact the EGA and provide details as to your data files / types and anticipated size.
b) Receive submission pack
The submission pack which includes login details for account, documents providing details
for key stages of submission and policy statements template for completion and return.
c) Upload data
Upload data files into your data upload account using EGA Webin Data uploader which
automatically encrypts and creates md5sum check values to make sure all your data is
uploaded correctly.
d) Document.
Provide details of study, samples, experiments, policy and datasets. Metadata is required to
be produced at this stage which can be done either using the Webin EGA tool or creating
and submitting one’s own XMLs.
5
https://www.ebi.ac.uk/ega/submission/phenotypes
If preparing XMLs for submission there are 2 key stages. The first stage for sequence data involves
the preparation of a submission, study, sample, experiment, DAC, Policy and Run XML files for
annotation (7 XML files in total). The second stage requires a submission (different to the one from
stage 1) and a dataset XML file (2 XML files in total). More details on XML files are provided in
Appendix A.
Example XML files that can be used / modified for submission are provided on the EGA user manual
website: https://www.ebi.ac.uk/ega/submission/manual#Example_XML_Submission.
The XML files are uploaded to a Test XML user account provided within the submission pack which
acts as a testing area whereby one can validate their XMLs.Once validated and fine, the finished
XMLs are uploaded in to a Production XML user account (also provided within submission pack).
Note:
 There is a Java uploader that needs to be installed if using the dropboxto upload files instead
of Aspera or FTP
 Data files affiliated to a submission are uploaded into private submission drop boxes using
FTP or Aspera protocols, which are provided as part of the submission procedure.
 Submitters are encouraged to use the EGA uploadertool
(https://www.ebi.ac.uk/ega/submission/tools), which encrypts, generates md5sum's and
uploads your files to your submission dropbox.
 Data files may also be uploaded manually using FTP or Aspera, but submitters *must*
ensure that all data files are encrypted with GnuPG and md5sum values are provided in the
format required. Please note, the EGA uploader tool may also be used to encrypt your files
and generate md5sum values *without* uploading your files.
 All submissions, except summary/aggregate level, require policy documentation. This
consists of 'Policy statements', 'Data Access Agreement (DAA)' and 'Data access application
form'
 EGA also require the submission of associated metadata, which includes contact details of
the submitter and sample descriptions. For sequence submissions we use XMLs and/or
Webin, and for Array based submissions we use the Array based format document to collect
this information.
2) Array Based Submissions
“The EGA accepts processed data from all types of array based technologies, such as genotypes,
gene expression, methylations, etc.”
https://www.ebi.ac.uk/ega/submission/manual#Contact_Genotype
The key stages to array based submissions are:
a) Contact EGA helpdesk.
Provide details of the data sizes and types you wish to submit.
b) Receive a submission pack
Includes unique accession numbers, login details for account, submission guide documents
and policy statements for completion.
c) Upload Data
Upload your data using the EGA Webin tool which automatically encrypts data and generate
md5cum values for checking.
d) Document
Provide details of the protocols, samples, experiments and policy documentation. One uses
the EGA-Array-based-Format document (EGA-AF) spreadsheet template to add meta-data
and policy documentation associated with each genotype submission.
*The EGA will only process the submission once the EGA-AF is completed and received.
The EGA-AF template comprises of 4 parts within a spreadsheet:
1) Investigator and Policy documents : information about study and policy documentation
Contact details of submitter
2) Sample and phenotype information
3) Datasets and description
4) Data files and how data is organised for distribution.
Appendix B walks through an example submission.
3) GWAS summary/aggregate submission
The EGA accepts submissions of complete summary level data associated with processed data such
as genotypes, structural variants and whole genome sequence with any value associated with these
calls. As an example, summary level data associated with genotypes called with separate algorithms
can be submitted.
If applicable, please ensure that your submission of summary level data does not contravene the
original consent agreements signed by the participants of the study.
WE DO NOT ACCEPT SUMMARY SUBMISSIONS BASED ON TOP LEVEL SNPS.
The steps are similar: contact EGA, receive submission pack with account details and upload data
and document.
The steps for the documentation are provided below :
The EGA-Genotype-Submission-Format (EGA-GSF) is a spreadsheet template for submitters to add
metadata associated with your summary level submission. Once completed and validated, the EGAGSF is used to produce a website that will describe and link to the submitted data.
The EGA can only process a submission once a completed EGA-Genotype-Submission-Format
document is received from the submitter.
The EGA-GSF spreadsheet consists of (three) components:
1) Study and investigators: Information including the title, description, publication and contact
details.
2) Dataset: Define how your data is going to be organised into datasets for distribution.
3) Data Files: Filenames of data files to be submitted and the name of the dataset to which they
will be affiliated.
An example is provided in Appendix C.
What happens after data is submitted successfully
A draft website is prepared, which will point to your study, dataset and Data Access Committee.
Once your draft website is completed, a member of the EGA will be in touch before your website
goes live to ensure:
1. Your study is represented accurately
2. Access to EGA user management tools is provided to the Data Access Committee named
contacts
3. Further information regarding the role of the Data Access Committee can be found here
Finally, your data is archived within our databases and prepared for encrypted distribution upon the
request of permitted EGA account holders.
We strongly advise you NOT to delete your data until we confirm that your data has been
successfully archived.
Policy documentation required for submissions
The following policy documentation is required to be prepared and submitted to the EGA, together
with your data files and associated metadata.
Data Access Agreement (DAA) - Data access application form - Policy statements
**All policy documentation should be emailed directly to EGA Helpdesk**
Please be advised that the EGA cannot process your submission without the documentation
shown below.
Data Access Agreement (DAA)
Please find below links to examples of Data Access Agreements (DAA) used by existing Data Access
Committees (DACs).
The Data Access Agreement is a contract made between user and Data Access Committee. The
agreement should be drafted by the DAC and includes, but is not limited to, details of data use,
publication restrictions and storage.
Completion of a DAA by the applicant/s should form part of the application process to the DAC.
Wellcome Trust Case Control Consortium DAA
Wellcome Trust Sanger Institute Cancer Genome Project (UK- Academic)
Wellcome Trust Sanger Institute Cancer Genome Project (US - Corporate)
Data access application form
Please find below links to examples of Data access forms used by existing Data Access Committees
(DACs).
The Data access form should be drafted by the DAC, for the purpose of capturing the necessary
information from a user wishing to access data.
Completion of a Data access application form by the applicant/s should form part of the application
process to the DAC.
MalariaGen Data access form
Wellcome Trust Case Control Consortium Data access form
Policy statements
Please find below a policy statement example. All submitters must provide the policy statements
captured in this template. An example policy statement is shown in Appendix D.
APPENDIX A:Creating and submitting XMLs
Taken Verbatim from:https://www.ebi.ac.uk/ega/submission/manual
All metadata required by the EGA may be collected using the EGA’s XMLs. Submitters are required
to prepare, validate and submit the XMLs.
Working with XML
We recommend manipulating EGA metadata using an XML editor, preferably one with the ability to
validate against XML schemas. A good article on choosing an XML editor can be found here.
Alternatively, XML can be edited in standard text editors and then checked using an XML validator,
e.g. xmllint, a free unix-based XML validator.
General concepts: Aliases and center names
Every EGA object must be uniquely identified within the submission account using the alias
attribute. The aliases can be used in submissions to make references between EGA objects. Please
find more information about the use of aliases and center names below:
alias attribute: every object should have a name that is unique within your submission account.
Once submitted successfully, every alias will be assigned an accession.
refname attribute: when an object references another by its alias, the alias goes into the refname
attribute. For example, if a sample has the alias "sample1", and an experiment uses this sample,
then the EXPERIMENT/SAMPLE/refname should be "sample1".
center_name attribute: The center_name attribute is required within the submission XML and will
be propagated to all other XMLs if not individually provided. This element is the controlled
vocabulary acronym or abbreviation that is provided to the account holder when the account is
first generated for an institute. If the submitter is brokering a submission for another institute, the
submitter should use their special broker account name in broker_name while the data centre
acronym remains in center_name.
run_center attribute: Many submitting centres contract out the sequencing to another centre. In
these cases, the sequencing centre should be acknowledged in the run_center attribute. Again, this
is controlled vocabulary and the acronym should be sought from EGA before submitting.
Validating and submitting your EGA XML's
Please submit your EGA XML's to your XML upload account. Please note, that your log-in details to
this account should have been provided at the beginning of the submission process.
Test XML upload account (recommended for first time users):
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/
Production XML upload account:
https://www.ebi.ac.uk/ena/submit/drop-box/submit/
Submitters are advised to use the Test XML upload account when submitting XML’s to the EGA for
the first time. The test service is identical to the production service except that all submissions will
be discarded on the following day.
We recommend that you validate all XMLs using the ‘VALIDATE’ action in your submission XML
before submitting using the ‘ADD’ action.”
Stage 1 XML Descriptions









Submission XML – describes the submission transaction, contact details, md5 checksum
values before and after encryption:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.submission.xsd
Study XML – describes study in detail, title, study name and abstract. Also provides unique
identifier / accession that can be used within the submission receipt:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.study.xsd
Sample XML – description of each of the samples used in the study:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.sample.xsd
Experimental XML – experimental details such library preparation, sequencing platforms
and type etc (different XML files for different platforms / experimental types):
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.experiment.xsd
DAC XML – description of the Data access policy and url:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/EGA.dataset.xsd
Policy XML – describes the data access agreement to be linked to the DAC:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_3/EGA.policy.xsd
Run XML – describes data file and relation to experiment:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.run.xsd
Another interesting file is the Analysis XML which can be used to submit BAM files to the
EGA with one BAM file for each analysis.
A similar XML file is also used for submitting VCF files to EGA.

Dataset XML describes the data files that constitute the dataset and linked to the
specific Policy in place and is defined by the Run.XML and Analysis.XML:
ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.dataset.xsd
The Submission_example.xml, Study _example.xml, Sample_example.xml, Experiment_example.xml,
Run_example.xml, DAC_example.xml and Policy_example.xml are submitted to your Production XML
upload account whereby on successful completion one obtains a receipt with accession numbers for
each object.
Stage 2 XML Descriptions
The second metadata information submission XML consists of 1 named XML object with file name
dataset used for stage 1 and people to contact for issues arising with the submission.
The Dataset XML groups each of the experiments / accession numbers into a single dataset with
accession policy.
A receipt will be generated upon successful completion.
Upon completion an EGA website / space is prepared and an EGA member will contact to ensure the
details are correct.
APPENDIX B: Example of a genotype data submission process
Taken directly from EGA documentation.What follows is an EGA-AF walk-through based on a
hypothetical case-control genotype submission consisting of 2 human lung samples genotyped with
2 different platforms:
Affymetrix_500K and Illumina_550K.
i) Individual contact details
ii) Details of data providers and data abstract
iii) Attaching policy documentation
Notes on policy documentation:

Document MUST be undersigned by an individual capable of confirming the statements
made therein (e.g. Principal Investigator)
 Please add your policy document template to your data file upload account or email directly
to EGA Helpdesk.
You can view an example of policy statements here.
iv) Details of your Data Access Committee (DAC)
Notes on Data Access Committees:



Please add your Data access application form and Data Access Agreement form to your data
file upload account or email directly to EGA Helpdesk.
View examples of a ‘Data Access Application’ and a ‘Data Access Agreement’
Further information on DAC's can be found here.
v) Further details of study and release policy
EGA-Array-based-Format document: Samples and phenotypes
What follows is a small sample of the Samples and phenotypes component, which consists of 2
samples from two individuals. Both samples have been genotyped using Affymetrix_500K and
Illumina_550K platforms and three types of genotype calling software have been used (chiamo,
brlmm and Illuminus).
You will find the Samples and phenotypes component located in the tab at the bottom of the
sheet shown here:
Important note: If you have uploaded files NOT using the EGA uploader, you must upload the
encrypted and unencrypted md5sum values of all files uploaded to your submission account. Your
submission will not be processed without md5sum values supplied for all files.
Datasets
What follows is a small sample of the dataset component. We suggest that each dataset should
consist of a common set of data. The example below consists of three datasets, grouped according
to shared data type, technology and by case/control.
We also like to capture the number of samples that make up a dataset and the Data Access
Committee responsible for approving access to the named dataset. You will find the Dataset
component located in the tab at the bottom of the sheet shown here:
Datafiles
What follows is an example of how to map your samples (detailed in the Samples and phenotype
tab) to the genotype files added to your upload account.
You will find the Genotype and SNP component located in the tab at the bottom of the sheet
shown here:
Important note: If you have uploaded files NOT using the EGA uploader, you must upload the
encrypted and unencrypted md5sum values of all files uploaded to your submission account. Your
submission will not be processed without md5sum values supplied for all files.
APPENDIX C: Example of a GWAS summary/aggregate submission
What follows is an EGA-GSF example based on a hypothetical case-control genotype summary
submission, conducted on 1500 human lung samples genotyped with 2 different platforms:
Affymetrix_500K and Illumina_550K.
Individual Contact details :
Details of Data provider and data abstract :
Further details of Study :
Dataset :
Data files:
APPENDIX D: Example Policy Statement