Plain text predictions

dPUC: Domain Prediction Using Context

by Alejandro Ochoa García, Manuel Llinás, Mona Singh

Extend your Pfam predictions without loss of precision using domain context!


en-us es-mx - dPUC Home - - Email Me

Download our plain text predictions

All files are compressed with gzip, and can be uncompressed in Windows with WinRAR. Both formats are plain text files with Unix newlines (which may display incorrectly in Windows and Mac systems).

Standard Pfam dPUC Pfam, E ≤ 1
Domains Gene Ontology Domains Gene Ontology
E. coli Ec.stdPfam Ec.stdPfam.GO Ec.dpucPfam Ec.dpucPfam.GO
M. tuberculosis Mt.stdPfam Mt.stdPfam.GO Mt.dpucPfam Mt.dpucPfam.GO
P. falciparum Pf.stdPfam Pf.stdPfam.GO Pf.dpucPfam Pf.dpucPfam.GO
P. vivax Pv.stdPfam Pv.stdPfam.GO Pv.dpucPfam Pv.dpucPfam.GO
P. knowlesi Pk.stdPfam Pk.stdPfam.GO Pk.dpucPfam Pk.dpucPfam.GO
P. chabaudi Pc.stdPfam Pc.stdPfam.GO Pc.dpucPfam Pc.dpucPfam.GO
P. berghei Pb.stdPfam Pb.stdPfam.GO Pb.dpucPfam Pb.dpucPfam.GO
P. yoelii Py.stdPfam Py.stdPfam.GO Py.dpucPfam Py.dpucPfam.GO
S. cerevisiae Sc.stdPfam Sc.stdPfam.GO Sc.dpucPfam Sc.dpucPfam.GO
C. elegans Ce.stdPfam Ce.stdPfam.GO Ce.dpucPfam Ce.dpucPfam.GO
D. melanogaster Dm.stdPfam Dm.stdPfam.GO Dm.dpucPfam Dm.dpucPfam.GO
H. sapiens Hs.stdPfam Hs.stdPfam.GO Hs.dpucPfam Hs.dpucPfam.GO

File formats

I hope you know how to parse data with Perl, because these formats are not standard.


These files are almost tab-delimited tables. The first line is the header, indicating in single words what each column contains. The domains in each protein are listed in each row (sorted by start site for convenience). However, each protein ID is introduced in a single line preceded by the ">" symbol, and its list of domains concludes when the next ID appears or the file ends (excluding the one-line header, this is analogous to the FASTA sequence format). In the example below, columns were manually arranged with spaces to have a fixed width to easily visualize the table, but the raw files are tab-separated, not fixed-width.

start end  acc     name          GA start2 end2 score E        scoreSeq ESeq    mode
94    442  PF05424 Duffy_binding 1  0      361  481.6 1.1e-141 675.4    5e-200  ls
608   755  PF03011 PFEMP         1  0      170  117.8 3.5e-32  305.7    9.6e-89 ls
867   1281 PF05424 Duffy_binding 1  0      361  193.8 4.8e-55  675.4    5e-200  ls
1440  1580 PF03011 PFEMP         1  0      170  187.9 2.9e-53  305.7    9.6e-89 ls
42    151  PF00085 Thioredoxin   1  0      109  -16.2 0.0044   -16.2    0.0044  ls
156   193  PF02985 HEAT          1  0      36   19.7  0.012    37.3     7.5e-15 ls
347   384  PF02985 HEAT          1  0      36   17.6  0.051    37.3     7.5e-15 ls
103   180  PF02617 ClpS          1  0      83   104.9 2.8e-28  104.9    2.8e-28 ls
4     72   PF08927 DUF1909       1  0      74   130.8 3.5e-44  130.8    3.5e-44 fs

For the Standard Pfam predictions, the columns have the following meanings.

For the dPUC Pfam predictions, all columns have the same meanings except for the following.

These files could have been in YAML format, as the data below, but this custom table format is much more compact (and parsing it is not so hard).

GO terms

These are pretty simple YAML files. Each protein ID is mapped to a list of predicted Gene Ontology (GO) terms. Proteins without GO term predictions are not listed. In the example below, protein MAL13P1.1 has three predicted GO terms (GO:0004872, GO:0009405, GO:0016021).

 - GO:0004872
 - GO:0009405
 - GO:0016021
 - GO:0003824
 - GO:0045454
 - GO:0005488
 - GO:0030163
 - GO:0005622
 - GO:0008270

Data sources

The protein sequences used to produce these domain predictions came from PlasmoDB 6.0 in the case of the Plasmodium species, and from Uniprot otherwise.

The "Standard Pfam" domain predictions were found using the HMMER 2.3.2 software, to compare our protein sequences against the Pfam 23 database of domains. The "dPUC Pfam" predictions were post-processed from the original HMMER predictions using dPUC 1.0.

The GO term predictions that follow from our domain predictions were found using the MultiPfam2GO procedure, which can be downloaded here (source code), and is cited below. To train the system, we used the Pfam 23 domain assignments on Uniprot, and the GO terms associated with those Uniprot sequences as downloaded on 2009-11-20.

Forslund K & Sonnhammer ELL. Predicting protein function from domain content. Bioinformatics 24, 1681-1687 (2008).