Plain text predictions

Download our plain text predictions

All files are compressed with gzip, and can be uncompressed in Windows with WinRAR. Both formats are plain text files with Unix newlines (which may display incorrectly in Windows and Mac systems).

	Standard Pfam		dPUC Pfam, E ≤ 1
	Domains	Gene Ontology	Domains	Gene Ontology
E. coli	Ec.stdPfam	Ec.stdPfam.GO	Ec.dpucPfam	Ec.dpucPfam.GO
M. tuberculosis	Mt.stdPfam	Mt.stdPfam.GO	Mt.dpucPfam	Mt.dpucPfam.GO
P. falciparum	Pf.stdPfam	Pf.stdPfam.GO	Pf.dpucPfam	Pf.dpucPfam.GO
P. vivax	Pv.stdPfam	Pv.stdPfam.GO	Pv.dpucPfam	Pv.dpucPfam.GO
P. knowlesi	Pk.stdPfam	Pk.stdPfam.GO	Pk.dpucPfam	Pk.dpucPfam.GO
P. chabaudi	Pc.stdPfam	Pc.stdPfam.GO	Pc.dpucPfam	Pc.dpucPfam.GO
P. berghei	Pb.stdPfam	Pb.stdPfam.GO	Pb.dpucPfam	Pb.dpucPfam.GO
P. yoelii	Py.stdPfam	Py.stdPfam.GO	Py.dpucPfam	Py.dpucPfam.GO
S. cerevisiae	Sc.stdPfam	Sc.stdPfam.GO	Sc.dpucPfam	Sc.dpucPfam.GO
C. elegans	Ce.stdPfam	Ce.stdPfam.GO	Ce.dpucPfam	Ce.dpucPfam.GO
D. melanogaster	Dm.stdPfam	Dm.stdPfam.GO	Dm.dpucPfam	Dm.dpucPfam.GO
H. sapiens	Hs.stdPfam	Hs.stdPfam.GO	Hs.dpucPfam	Hs.dpucPfam.GO

File formats

I hope you know how to parse data with Perl, because these formats are not standard.

Domains

These files are almost tab-delimited tables. The first line is the header, indicating in single words what each column contains. The domains in each protein are listed in each row (sorted by start site for convenience). However, each protein ID is introduced in a single line preceded by the ">" symbol, and its list of domains concludes when the next ID appears or the file ends (excluding the one-line header, this is analogous to the FASTA sequence format). In the example below, columns were manually arranged with spaces to have a fixed width to easily visualize the table, but the raw files are tab-separated, not fixed-width.

start end  acc     name          GA start2 end2 score E        scoreSeq ESeq    mode
>MAL13P1.1
94    442  PF05424 Duffy_binding 1  0      361  481.6 1.1e-141 675.4    5e-200  ls
608   755  PF03011 PFEMP         1  0      170  117.8 3.5e-32  305.7    9.6e-89 ls
867   1281 PF05424 Duffy_binding 1  0      361  193.8 4.8e-55  675.4    5e-200  ls
1440  1580 PF03011 PFEMP         1  0      170  187.9 2.9e-53  305.7    9.6e-89 ls
>MAL13P1.100
42    151  PF00085 Thioredoxin   1  0      109  -16.2 0.0044   -16.2    0.0044  ls
>MAL13P1.105
156   193  PF02985 HEAT          1  0      36   19.7  0.012    37.3     7.5e-15 ls
347   384  PF02985 HEAT          1  0      36   17.6  0.051    37.3     7.5e-15 ls
>MAL13P1.111
103   180  PF02617 ClpS          1  0      83   104.9 2.8e-28  104.9    2.8e-28 ls
>MAL13P1.115
4     72   PF08927 DUF1909       1  0      74   130.8 3.5e-44  130.8    3.5e-44 fs

For the Standard Pfam predictions, the columns have the following meanings.

start, end: the range of the predicted domain on the protein sequence. NOTE: the positions are numbered starting from zero (zero-based), and the end position marks the first position outside the domain (end-exclusive). This rule follows the indexing conventions of a multitude of programming languages (including C and Perl), and this way the length of the domain is end-start. To convert to traditional ranges, where we start counting from 1 and the end is the last position inside the domain, simply add one to the start (the end stays the same).
start2, end2: the range of the prediction in the HMM (profile hidden Markov model) of the domain (this range is non-trivial since domains may be fragmented). NOTE: this range is also zero-based, end-exclusive (see above).
acc, name: the Pfam accession and name of each domain.
GA: a boolean (true=1/false=0) that indicates whether this domain passed the Standard Pfam thresholds (non-trivial for our dPUC predictions). GA = "gathering" threshold, is Pfam jargon.
score, E: the original HMM score and E-value of this domain.
scoreSeq, ESeq: the sum of the original HMM scores of all domains of the same family in this sequence, and the E-value of this sum of scores. Seq = "sequence" score, is Pfam jargon. The reason for summing these scores is that Pfam sets a second threshold on this sum (the first threshold is on each domain).
mode: the alignment mode of this domain. ls = "glocal" mode (complete domains are aligned), fs = "local" mode (fragments of domains are aligned), again both are in Pfam jargon.

For the dPUC Pfam predictions, all columns have the same meanings except for the following.

score: this is the final score of this domain (original HMM score plus context scores). Note that although the score changes, the reported E-value of this domain stays the same as without context (our method does not re-estimate E-values).
scoreHmm: this is the original HMM score of this domain.
scoreContext: this is the sum of the context scores of this domain.
scoreSeq, scoreHmmSeq, scoreContextSeq: these are the sums of the respective scores over all domains of the same family in this sequence, in analogy to the Standard Pfam definition above.

These files could have been in YAML format, as the data below, but this custom table format is much more compact (and parsing it is not so hard).

GO terms

These are pretty simple YAML files. Each protein ID is mapped to a list of predicted Gene Ontology (GO) terms. Proteins without GO term predictions are not listed. In the example below, protein MAL13P1.1 has three predicted GO terms (GO:0004872, GO:0009405, GO:0016021).

---
MAL13P1.1:
 - GO:0004872
 - GO:0009405
 - GO:0016021
MAL13P1.100:
 - GO:0003824
 - GO:0045454
MAL13P1.105:
 - GO:0005488
MAL13P1.111:
 - GO:0030163
MAL13P1.115:
 - GO:0005622
 - GO:0008270

Data sources

The protein sequences used to produce these domain predictions came from PlasmoDB 6.0 in the case of the Plasmodium species, and from Uniprot otherwise.

The "Standard Pfam" domain predictions were found using the HMMER 2.3.2 software, to compare our protein sequences against the Pfam 23 database of domains. The "dPUC Pfam" predictions were post-processed from the original HMMER predictions using dPUC 1.0.

The GO term predictions that follow from our domain predictions were found using the MultiPfam2GO procedure, which can be downloaded here (source code), and is cited below. To train the system, we used the Pfam 23 domain assignments on Uniprot, and the GO terms associated with those Uniprot sequences as downloaded on 2009-11-20.

Forslund K & Sonnhammer ELL. Predicting protein function from domain content. Bioinformatics 24, 1681-1687 (2008).