Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics (doi:10.7910/DVN/JMFHTN)

View:

Part 1: Document Description
Part 2: Study Description
Part 3: Data Files Description
Part 4: Variable Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Identification Number:

doi:10.7910/DVN/JMFHTN

Distributor:

Harvard Dataverse

Date of Distribution:

2015-10-08

Version:

1

Bibliographic Citation:

Asgari, Ehsaneddin, 2015, "Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics", https://doi.org/10.7910/DVN/JMFHTN, Harvard Dataverse, V1, UNF:6:MdFOywP8u70n6695tyjGAw== [fileUNF]

Study Description

Citation

Title:

Replication Data for: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Identification Number:

doi:10.7910/DVN/JMFHTN

Authoring Entity:

Asgari, Ehsaneddin (University of California, Berkeley)

Distributor:

Harvard Dataverse

Access Authority:

Asgari, Ehsaneddin

Depositor:

Asgari, Ehsaneddin

Date of Deposit:

2015-10-08

Holdings Information:

https://doi.org/10.7910/DVN/JMFHTN

Study Scope

Keywords:

Chemistry, Computer and Information Science, Medicine, Health and Life Sciences, Deep Proteomic, Deep Learning, Deep Genomics, Distributed Representation, Disordered Protein Prediction, Family Classification Benchmark, Protein Data Visualization, t-SNE, Word2Vec

Abstract:

Users should cite: Asgari E, Mofrad MRK. <a href='http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287' >Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics</a>. PLoS ONE 10(11): e0141287. doi:10.1371/journal.pone.0141287. This archive also contains the family classification data that we used in the above mentioned PLoS ONE paper. This data can be used as a benchmark for family classification task.

Methodology and Processing

Sources Statement

Data Access

Citation Requirement:

If you are using this data and method please cite the following paper: Asgari, Ehsaneddin and Mofrad Mohmmad R.K. "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PloS one (2015). In Press.

Notes:

This dataset is made available under a Creative Commons CC0 license with the following additional/modified terms and conditions:

If you are using this data and method please cite the following paper: Asgari, Ehsaneddin and Mofrad Mohmmad R.K. "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PloS one (2015). In Press.

Other Study Description Materials

Related Publications

Citation

Title:

Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. https://doi.org/10.1371/journal.pone.0141287

Identification Number:

10.1371/journal.pone.0141287

Bibliographic Citation:

Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. https://doi.org/10.1371/journal.pone.0141287

File Description--f2712444

File: family_classification_metadata.tab

  • Number of cases: 324018

  • No. of variables per record: 5

  • Type of File: text/tab-separated-values

Notes:

UNF:6:njDXLW5qKIEwZHQNVwue0g==

File Description--f2712443

File: family_classification_sequences.tab

  • Number of cases: 324018

  • No. of variables per record: 1

  • Type of File: text/tab-separated-values

Notes:

UNF:6:yizYYNIv4P+F07al61ev0g==

Variable Description

List of Variables:

Variables

SwissProtAccessionID

f2712444 Location:

Variable Format: character

Notes: UNF:6:/YAzQQUr0TT+SrQCOLZY7g==

LongID

f2712444 Location:

Variable Format: character

Notes: UNF:6:GDMHWRUl4xOqo19A5EpPSQ==

ProteinName

f2712444 Location:

Variable Format: character

Notes: UNF:6:DnOuVFUmi/vjizsxYrRJdA==

FamilyID

f2712444 Location:

Variable Format: character

Notes: UNF:6:aPOODovMmWM64CAmrByv1w==

FamilyDescription

f2712444 Location:

Variable Format: character

Notes: UNF:6:OvrFPu8auY/NVFHl+biN4Q==

Sequences

f2712443 Location:

Variable Format: character

Notes: UNF:6:yizYYNIv4P+F07al61ev0g==

Other Study-Related Materials

Label:

family_classification_protVec.csv

Notes:

text/csv

Other Study-Related Materials

Label:

protVec_100d_3grams.csv

Text:

protein-vectors (ProtVec) : Distributed Representation for Proteins, for deep learning applications of proteomics. Each 3-gram is presented by a 100D vector.

Notes:

text/csv