Extend your Pfam predictions without loss of precision using domain context!
You need Javascript for this menu to work properly!
All files are compressed with gzip, and can be uncompressed in Windows with WinRAR. Both formats are plain text files with Unix newlines (which may display incorrectly in Windows and Mac systems).
Standard Pfam | dPUC Pfam, E ≤ 1 | |||
Domains | Gene Ontology | Domains | Gene Ontology | |
E. coli | Ec.stdPfam | Ec.stdPfam.GO | Ec.dpucPfam | Ec.dpucPfam.GO |
M. tuberculosis | Mt.stdPfam | Mt.stdPfam.GO | Mt.dpucPfam | Mt.dpucPfam.GO |
P. falciparum | Pf.stdPfam | Pf.stdPfam.GO | Pf.dpucPfam | Pf.dpucPfam.GO |
P. vivax | Pv.stdPfam | Pv.stdPfam.GO | Pv.dpucPfam | Pv.dpucPfam.GO |
P. knowlesi | Pk.stdPfam | Pk.stdPfam.GO | Pk.dpucPfam | Pk.dpucPfam.GO |
P. chabaudi | Pc.stdPfam | Pc.stdPfam.GO | Pc.dpucPfam | Pc.dpucPfam.GO |
P. berghei | Pb.stdPfam | Pb.stdPfam.GO | Pb.dpucPfam | Pb.dpucPfam.GO |
P. yoelii | Py.stdPfam | Py.stdPfam.GO | Py.dpucPfam | Py.dpucPfam.GO |
S. cerevisiae | Sc.stdPfam | Sc.stdPfam.GO | Sc.dpucPfam | Sc.dpucPfam.GO |
C. elegans | Ce.stdPfam | Ce.stdPfam.GO | Ce.dpucPfam | Ce.dpucPfam.GO |
D. melanogaster | Dm.stdPfam | Dm.stdPfam.GO | Dm.dpucPfam | Dm.dpucPfam.GO |
H. sapiens | Hs.stdPfam | Hs.stdPfam.GO | Hs.dpucPfam | Hs.dpucPfam.GO |
I hope you know how to parse data with Perl, because these formats are not standard.
These files are almost tab-delimited tables. The first line is the header, indicating in single words what each column contains. The domains in each protein are listed in each row (sorted by start site for convenience). However, each protein ID is introduced in a single line preceded by the ">" symbol, and its list of domains concludes when the next ID appears or the file ends (excluding the one-line header, this is analogous to the FASTA sequence format). In the example below, columns were manually arranged with spaces to have a fixed width to easily visualize the table, but the raw files are tab-separated, not fixed-width.
start end acc name GA start2 end2 score E scoreSeq ESeq mode
>MAL13P1.1
94 442 PF05424 Duffy_binding 1 0 361 481.6 1.1e-141 675.4 5e-200 ls
608 755 PF03011 PFEMP 1 0 170 117.8 3.5e-32 305.7 9.6e-89 ls
867 1281 PF05424 Duffy_binding 1 0 361 193.8 4.8e-55 675.4 5e-200 ls
1440 1580 PF03011 PFEMP 1 0 170 187.9 2.9e-53 305.7 9.6e-89 ls
>MAL13P1.100
42 151 PF00085 Thioredoxin 1 0 109 -16.2 0.0044 -16.2 0.0044 ls
>MAL13P1.105
156 193 PF02985 HEAT 1 0 36 19.7 0.012 37.3 7.5e-15 ls
347 384 PF02985 HEAT 1 0 36 17.6 0.051 37.3 7.5e-15 ls
>MAL13P1.111
103 180 PF02617 ClpS 1 0 83 104.9 2.8e-28 104.9 2.8e-28 ls
>MAL13P1.115
4 72 PF08927 DUF1909 1 0 74 130.8 3.5e-44 130.8 3.5e-44 fs
For the Standard Pfam predictions, the columns have the following meanings.
For the dPUC Pfam predictions, all columns have the same meanings except for the following.
These files could have been in YAML format, as the data below, but this custom table format is much more compact (and parsing it is not so hard).
These are pretty simple YAML files. Each protein ID is mapped to a list of predicted Gene Ontology (GO) terms. Proteins without GO term predictions are not listed. In the example below, protein MAL13P1.1 has three predicted GO terms (GO:0004872, GO:0009405, GO:0016021).
---
MAL13P1.1:
- GO:0004872
- GO:0009405
- GO:0016021
MAL13P1.100:
- GO:0003824
- GO:0045454
MAL13P1.105:
- GO:0005488
MAL13P1.111:
- GO:0030163
MAL13P1.115:
- GO:0005622
- GO:0008270
The protein sequences used to produce these domain predictions came from PlasmoDB 6.0 in the case of the Plasmodium species, and from Uniprot otherwise.
The "Standard Pfam" domain predictions were found using the HMMER 2.3.2 software, to compare our protein sequences against the Pfam 23 database of domains. The "dPUC Pfam" predictions were post-processed from the original HMMER predictions using dPUC 1.0.
The GO term predictions that follow from our domain predictions were found using the MultiPfam2GO procedure, which can be downloaded here (source code), and is cited below. To train the system, we used the Pfam 23 domain assignments on Uniprot, and the GO terms associated with those Uniprot sequences as downloaded on 2009-11-20.