Download App C H3Africa EGA Submission Guidelines 13092013
Transcript
European Genome-phenome Archive (EGA) General Information 1) What is the EGA? The EGA1 is a centralised repository that permanently stores and provides user access to all types of genetic and phenotypic biomedical research data which are personally identifiable. The EGA archives data from study participants who have expressly provided their consent agreement to only release data for specific research use and to bona fide researchers. Very strict policies and protocols are in place which determines how the data is managed, stored and distributed. Only members of the EGA are allowed to process data on site and all data is encrypted with the encryption keys / codes. 2) EGA System and Security The EGA is a secure computing facility that uses a shared EBI setup for data submissions and has a petabyte scale archive for original data files and data access via the website. All data outside this secure EGA site is encrypted. EBI web team has implemented strict web protocols that include phpBB for https logins and also maintain the file archive and data submission parts. EGA itself is not visible to any EBI network machine and only to EGA staff. The EGA data archive is modular to provide high security and performance for large archived datasets. Individuals, their samples and phenotypes are stored in separate databases. Experimental data are also stored in separate databases depending on the type of data. The EGA short read archive is only for raw sequence data. Processed data types must be submitted separately to the EGA and are stored in dedicated databases. 3) How is data accessed from the EGA? The EGA implements a distributed access granting policy where the decision to grant access to the data resides with the relevant consortium data access committee (DAC). In order to access the data in the EGA, a scientist has to apply in writing to the DAC. The DAC is primarily responsible for granting and making all data access decisions and not the EGA who merely facilitates the secure archiving and hosting of the data. The DAC is composed of members from the organisation that produced the data and not EGA personnel. The Data Access Agreement (DAA) is a contract made directly between the relevant DAC and applicant wishing to access the data. The EGA will only provide access to the data once a successful application process has been passed onto the EGA from the DAC. 1 https://www.ebi.ac.uk/ega/node/66 EGA provides encryption keys / codes physically and offline after one is granted access to a particular dataset / studyby approval from a DAC.These can be used to obtain web access. The EGA provides support for consortium members to access the data before publication by allowing access to only the consortium data and study websites by means of authorised secure logins. All consortium members have access to the consortium web page which is used to collate data on consortium projects in the EGA. The pre-publication phase for this is usually between 6 – 12 months. 4) Consortium Specific Website The EGA creates a separate website for each consortium that deposits data within its system. The website provides information about the consortium, a link back to that consortium’s website and a list of archived studies by that consortium. Each study is assigned a stable identifier that can be referred to in publications. Authorised data access requires a user to login and files designated for distribution will be encrypted and moved to a dedicated disk area outside the EGA. Each file is made available to only those users that are granted access to the data by the DAC. Hyperlinks to download data are available in the secure login website to approved users only and a script will verify the access to the data before downloads begin. For large datasets web download is not optimal and EGA enable temporary FTP access via as Aspera account. The Aspera account requires a password and can only be used by one person at a time. 5) Data Submission Before the EGA can accept any data submissions the following policy documentation is required to be submitted with the data: Data access agreement, data access application form and a Policy statement (examples are provided in submission pack when contacting the EGA for submitting data). EGA only accepts de-identified data with a DAC approved plan. Accepted data types include manufacturer specific raw data formats from the array based and new sequencing platforms. The processed data which may include genotype, structural variants or any summary level statistical analyses from the original study authors are stored in databases. The EGA will also accept and provide access to any phenotype data associated with the data. Prior to submission, the submitters must contact the EGA. EGA Data Types and Formats for Submission 1) What type of data does the EGA Accept? The EGA accepts2 : a) Sequence data (raw/ unaligned and analysis/aligned), including RNASeq, resequencing, epigenomics, transcriptomics and other sequence based assays3. 2 3 https://www.ebi.ac.uk/ega/submission https://www.ebi.ac.uk/ega/submission/sequence b) Array based data (genotypes, SNP, Expression) and their associated phenotypic information4. c) Analysis based submissions. 2) What File formats does EGA Accept? a) Sequence Based Submissions i. Sequences BAM format is the preferred EGA option. All BAM files submitted must be able to be read with SAMtools and Picard. Currently BAM files must be de-multiplexed prior to submission (plans to accept submission of BAM files with reads from multiple samples are in the pipeline). Table 1: Summary of information on supported file formats for EGA Sequence based data Supported BAM – most preferred SFF for 454 data Convert Illumina Scarf Format to FastQ Convert Illuminaqseq to FastQ format before submission FastQ format PacBio HDF5 format Not Supported Colour spaced BAM Files not de-multiplexed before submission Signal data for Illumina GA / Hiseq and SOLiD Platforms Scarf Format from Illumina (not greatly supported) Illuminaqseq format Pooled data must be de-multiplexed and barcoded before submission Complete Genomics data (with some caveats – see below) Reference alignments – BAM format Sequence Variations – VCF format Note: Colour spaced BAM files are not supported. Data files have to be de-multiplexed before submission so that each run is submitted with files containing data for a single sample only. Signal data is no longer accepted for Illumina GA/Hiseq and SOLiD platforms but continues to be supported for the 454 platform. The minimum submission level for EGA is base/colour calls with quality scores. As BAM is near optimal in terms of compression, files should be submitted uncompressed. For 454 data the EGA accepts SFF which are also compressed and should be submitted uncompressed. Illumina Scarf Format – EGA will accept but are not keen on it as these submissions cannot be processed or made available in other formats. 4 https://www.ebi.ac.uk/ega/submission/array_based EGA requires one to convert Illumina Scarf Format data to Fastq prior to submission, and to convert those scarf format logs-odds qualities to Phred qualities when preparing the FastQ files for submission. Illuminaqseq format – accept but not happy with this format for the same reasons as the scarf format above. Submitters should convert from qseq to fastq format. Data submissions in PacBio HDF5 format are accepted. “Complete Genomics data should be submitted using the intact Complete Genomics directory tree structure containing the ASM, LIB and MAP subfolders. Each individual genome should be submitted as a single Run object associated with a single Experiment and Sample object. Please note that the reads and mappings in the MAP directory should be included in the submission.” If submitting Fastq files then the compression algorithm to use is gzip or bzip2. For FastQ format, primary sequence data submissions of single and paired reads are accepted as Fastq files that meet the following the requirements: Quality scores must be in Phred scale. For example, quality scores from early Solexa pipelines must be converted to use this scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64. No technical reads (adapters, linkers, barcodes) are allowed. Single reads must be submitted using a single Fastq file and can be submitted with or without read names. Paired reads must split and submitted using either one or two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads "^(.*)([\\.|:|/|_])([12])$"). The first line for each read must start with '@'. The base calls and quality scores must be separated by a line starting with '+'. The Fastq files must be compressed using gzip or bzip2. Example of Fastq file containing single reads: @read_name GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ... Example of Fastq file containing paired reads: @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ... where<cycle> indicates the cycle number that starts the second read.The Fastq files should be compressed using gzip. ii. Secondary analysis formats The EGA supports 2 types of analyses, reference alignments in BAM format and sequence variations in VCF format. b) Array Based Submissions The EGA supports submission of processed data from all types of array based technologies such as genotype, gene expression, methylations etc.EGA also archives any associated phenotypic data. EGA does not provide any more information other than what is above on files accepted for Array based submissions in their online user manual. c) Analysis Based Submissions These include genotypes (summary or aggregate), structural variants (VCF), expression and phenotype5. Accepted data types include raw data manufacturer specific formats from array based and NGS platforms and processed data such as genotypes, structural variants, aligned reads. Key Stages of EGA Submission 1) Sequence Based Submissions to the EGA There are 4 key stages for the submission of sequence based data according to the EGA manual (https://www.ebi.ac.uk/ega/submission/manual): a) Contact EGA Contact the EGA and provide details as to your data files / types and anticipated size. b) Receive submission pack The submission pack which includes login details for account, documents providing details for key stages of submission and policy statements template for completion and return. c) Upload data Upload data files into your data upload account using EGA Webin Data uploader which automatically encrypts and creates md5sum check values to make sure all your data is uploaded correctly. d) Document. Provide details of study, samples, experiments, policy and datasets. Metadata is required to be produced at this stage which can be done either using the Webin EGA tool or creating and submitting one’s own XMLs. 5 https://www.ebi.ac.uk/ega/submission/phenotypes If preparing XMLs for submission there are 2 key stages. The first stage for sequence data involves the preparation of a submission, study, sample, experiment, DAC, Policy and Run XML files for annotation (7 XML files in total). The second stage requires a submission (different to the one from stage 1) and a dataset XML file (2 XML files in total). More details on XML files are provided in Appendix A. Example XML files that can be used / modified for submission are provided on the EGA user manual website: https://www.ebi.ac.uk/ega/submission/manual#Example_XML_Submission. The XML files are uploaded to a Test XML user account provided within the submission pack which acts as a testing area whereby one can validate their XMLs.Once validated and fine, the finished XMLs are uploaded in to a Production XML user account (also provided within submission pack). Note: There is a Java uploader that needs to be installed if using the dropboxto upload files instead of Aspera or FTP Data files affiliated to a submission are uploaded into private submission drop boxes using FTP or Aspera protocols, which are provided as part of the submission procedure. Submitters are encouraged to use the EGA uploadertool (https://www.ebi.ac.uk/ega/submission/tools), which encrypts, generates md5sum's and uploads your files to your submission dropbox. Data files may also be uploaded manually using FTP or Aspera, but submitters *must* ensure that all data files are encrypted with GnuPG and md5sum values are provided in the format required. Please note, the EGA uploader tool may also be used to encrypt your files and generate md5sum values *without* uploading your files. All submissions, except summary/aggregate level, require policy documentation. This consists of 'Policy statements', 'Data Access Agreement (DAA)' and 'Data access application form' EGA also require the submission of associated metadata, which includes contact details of the submitter and sample descriptions. For sequence submissions we use XMLs and/or Webin, and for Array based submissions we use the Array based format document to collect this information. 2) Array Based Submissions “The EGA accepts processed data from all types of array based technologies, such as genotypes, gene expression, methylations, etc.” https://www.ebi.ac.uk/ega/submission/manual#Contact_Genotype The key stages to array based submissions are: a) Contact EGA helpdesk. Provide details of the data sizes and types you wish to submit. b) Receive a submission pack Includes unique accession numbers, login details for account, submission guide documents and policy statements for completion. c) Upload Data Upload your data using the EGA Webin tool which automatically encrypts data and generate md5cum values for checking. d) Document Provide details of the protocols, samples, experiments and policy documentation. One uses the EGA-Array-based-Format document (EGA-AF) spreadsheet template to add meta-data and policy documentation associated with each genotype submission. *The EGA will only process the submission once the EGA-AF is completed and received. The EGA-AF template comprises of 4 parts within a spreadsheet: 1) Investigator and Policy documents : information about study and policy documentation Contact details of submitter 2) Sample and phenotype information 3) Datasets and description 4) Data files and how data is organised for distribution. Appendix B walks through an example submission. 3) GWAS summary/aggregate submission The EGA accepts submissions of complete summary level data associated with processed data such as genotypes, structural variants and whole genome sequence with any value associated with these calls. As an example, summary level data associated with genotypes called with separate algorithms can be submitted. If applicable, please ensure that your submission of summary level data does not contravene the original consent agreements signed by the participants of the study. WE DO NOT ACCEPT SUMMARY SUBMISSIONS BASED ON TOP LEVEL SNPS. The steps are similar: contact EGA, receive submission pack with account details and upload data and document. The steps for the documentation are provided below : The EGA-Genotype-Submission-Format (EGA-GSF) is a spreadsheet template for submitters to add metadata associated with your summary level submission. Once completed and validated, the EGAGSF is used to produce a website that will describe and link to the submitted data. The EGA can only process a submission once a completed EGA-Genotype-Submission-Format document is received from the submitter. The EGA-GSF spreadsheet consists of (three) components: 1) Study and investigators: Information including the title, description, publication and contact details. 2) Dataset: Define how your data is going to be organised into datasets for distribution. 3) Data Files: Filenames of data files to be submitted and the name of the dataset to which they will be affiliated. An example is provided in Appendix C. What happens after data is submitted successfully A draft website is prepared, which will point to your study, dataset and Data Access Committee. Once your draft website is completed, a member of the EGA will be in touch before your website goes live to ensure: 1. Your study is represented accurately 2. Access to EGA user management tools is provided to the Data Access Committee named contacts 3. Further information regarding the role of the Data Access Committee can be found here Finally, your data is archived within our databases and prepared for encrypted distribution upon the request of permitted EGA account holders. We strongly advise you NOT to delete your data until we confirm that your data has been successfully archived. Policy documentation required for submissions The following policy documentation is required to be prepared and submitted to the EGA, together with your data files and associated metadata. Data Access Agreement (DAA) - Data access application form - Policy statements **All policy documentation should be emailed directly to EGA Helpdesk** Please be advised that the EGA cannot process your submission without the documentation shown below. Data Access Agreement (DAA) Please find below links to examples of Data Access Agreements (DAA) used by existing Data Access Committees (DACs). The Data Access Agreement is a contract made between user and Data Access Committee. The agreement should be drafted by the DAC and includes, but is not limited to, details of data use, publication restrictions and storage. Completion of a DAA by the applicant/s should form part of the application process to the DAC. Wellcome Trust Case Control Consortium DAA Wellcome Trust Sanger Institute Cancer Genome Project (UK- Academic) Wellcome Trust Sanger Institute Cancer Genome Project (US - Corporate) Data access application form Please find below links to examples of Data access forms used by existing Data Access Committees (DACs). The Data access form should be drafted by the DAC, for the purpose of capturing the necessary information from a user wishing to access data. Completion of a Data access application form by the applicant/s should form part of the application process to the DAC. MalariaGen Data access form Wellcome Trust Case Control Consortium Data access form Policy statements Please find below a policy statement example. All submitters must provide the policy statements captured in this template. An example policy statement is shown in Appendix D. APPENDIX A:Creating and submitting XMLs Taken Verbatim from:https://www.ebi.ac.uk/ega/submission/manual All metadata required by the EGA may be collected using the EGA’s XMLs. Submitters are required to prepare, validate and submit the XMLs. Working with XML We recommend manipulating EGA metadata using an XML editor, preferably one with the ability to validate against XML schemas. A good article on choosing an XML editor can be found here. Alternatively, XML can be edited in standard text editors and then checked using an XML validator, e.g. xmllint, a free unix-based XML validator. General concepts: Aliases and center names Every EGA object must be uniquely identified within the submission account using the alias attribute. The aliases can be used in submissions to make references between EGA objects. Please find more information about the use of aliases and center names below: alias attribute: every object should have a name that is unique within your submission account. Once submitted successfully, every alias will be assigned an accession. refname attribute: when an object references another by its alias, the alias goes into the refname attribute. For example, if a sample has the alias "sample1", and an experiment uses this sample, then the EXPERIMENT/SAMPLE/refname should be "sample1". center_name attribute: The center_name attribute is required within the submission XML and will be propagated to all other XMLs if not individually provided. This element is the controlled vocabulary acronym or abbreviation that is provided to the account holder when the account is first generated for an institute. If the submitter is brokering a submission for another institute, the submitter should use their special broker account name in broker_name while the data centre acronym remains in center_name. run_center attribute: Many submitting centres contract out the sequencing to another centre. In these cases, the sequencing centre should be acknowledged in the run_center attribute. Again, this is controlled vocabulary and the acronym should be sought from EGA before submitting. Validating and submitting your EGA XML's Please submit your EGA XML's to your XML upload account. Please note, that your log-in details to this account should have been provided at the beginning of the submission process. Test XML upload account (recommended for first time users): https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/ Production XML upload account: https://www.ebi.ac.uk/ena/submit/drop-box/submit/ Submitters are advised to use the Test XML upload account when submitting XML’s to the EGA for the first time. The test service is identical to the production service except that all submissions will be discarded on the following day. We recommend that you validate all XMLs using the ‘VALIDATE’ action in your submission XML before submitting using the ‘ADD’ action.” Stage 1 XML Descriptions Submission XML – describes the submission transaction, contact details, md5 checksum values before and after encryption: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.submission.xsd Study XML – describes study in detail, title, study name and abstract. Also provides unique identifier / accession that can be used within the submission receipt: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.study.xsd Sample XML – description of each of the samples used in the study: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.sample.xsd Experimental XML – experimental details such library preparation, sequencing platforms and type etc (different XML files for different platforms / experimental types): ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.experiment.xsd DAC XML – description of the Data access policy and url: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/EGA.dataset.xsd Policy XML – describes the data access agreement to be linked to the DAC: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_3/EGA.policy.xsd Run XML – describes data file and relation to experiment: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.run.xsd Another interesting file is the Analysis XML which can be used to submit BAM files to the EGA with one BAM file for each analysis. A similar XML file is also used for submitting VCF files to EGA. Dataset XML describes the data files that constitute the dataset and linked to the specific Policy in place and is defined by the Run.XML and Analysis.XML: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_4/SRA.dataset.xsd The Submission_example.xml, Study _example.xml, Sample_example.xml, Experiment_example.xml, Run_example.xml, DAC_example.xml and Policy_example.xml are submitted to your Production XML upload account whereby on successful completion one obtains a receipt with accession numbers for each object. Stage 2 XML Descriptions The second metadata information submission XML consists of 1 named XML object with file name dataset used for stage 1 and people to contact for issues arising with the submission. The Dataset XML groups each of the experiments / accession numbers into a single dataset with accession policy. A receipt will be generated upon successful completion. Upon completion an EGA website / space is prepared and an EGA member will contact to ensure the details are correct. APPENDIX B: Example of a genotype data submission process Taken directly from EGA documentation.What follows is an EGA-AF walk-through based on a hypothetical case-control genotype submission consisting of 2 human lung samples genotyped with 2 different platforms: Affymetrix_500K and Illumina_550K. i) Individual contact details ii) Details of data providers and data abstract iii) Attaching policy documentation Notes on policy documentation: Document MUST be undersigned by an individual capable of confirming the statements made therein (e.g. Principal Investigator) Please add your policy document template to your data file upload account or email directly to EGA Helpdesk. You can view an example of policy statements here. iv) Details of your Data Access Committee (DAC) Notes on Data Access Committees: Please add your Data access application form and Data Access Agreement form to your data file upload account or email directly to EGA Helpdesk. View examples of a ‘Data Access Application’ and a ‘Data Access Agreement’ Further information on DAC's can be found here. v) Further details of study and release policy EGA-Array-based-Format document: Samples and phenotypes What follows is a small sample of the Samples and phenotypes component, which consists of 2 samples from two individuals. Both samples have been genotyped using Affymetrix_500K and Illumina_550K platforms and three types of genotype calling software have been used (chiamo, brlmm and Illuminus). You will find the Samples and phenotypes component located in the tab at the bottom of the sheet shown here: Important note: If you have uploaded files NOT using the EGA uploader, you must upload the encrypted and unencrypted md5sum values of all files uploaded to your submission account. Your submission will not be processed without md5sum values supplied for all files. Datasets What follows is a small sample of the dataset component. We suggest that each dataset should consist of a common set of data. The example below consists of three datasets, grouped according to shared data type, technology and by case/control. We also like to capture the number of samples that make up a dataset and the Data Access Committee responsible for approving access to the named dataset. You will find the Dataset component located in the tab at the bottom of the sheet shown here: Datafiles What follows is an example of how to map your samples (detailed in the Samples and phenotype tab) to the genotype files added to your upload account. You will find the Genotype and SNP component located in the tab at the bottom of the sheet shown here: Important note: If you have uploaded files NOT using the EGA uploader, you must upload the encrypted and unencrypted md5sum values of all files uploaded to your submission account. Your submission will not be processed without md5sum values supplied for all files. APPENDIX C: Example of a GWAS summary/aggregate submission What follows is an EGA-GSF example based on a hypothetical case-control genotype summary submission, conducted on 1500 human lung samples genotyped with 2 different platforms: Affymetrix_500K and Illumina_550K. Individual Contact details : Details of Data provider and data abstract : Further details of Study : Dataset : Data files: APPENDIX D: Example Policy Statement