Domain predictions for the Apicomplexa

using DomStratStats

by Alejandro Ochoa García


en-us es-mx -

The goal was to generate domain predictions for Plasmodium falciparum in particular, and predictions were generated on the rest of the Apicomplexa only to improve the q-value statistics, which are estimated across the set. This is because a single proteome has too little data to give good q-value estimates for most domain families (which are analyzed independently). For this reason, the P. falciparum proteome used is much newer than the rest of the proteomes, which are only auxiliary. However, I include all the data for completeness and in the hope that the domain predictions on the other organisms are useful despite some of their sequences being outdated.

Download our raw domain data

All files are compressed with gzip, and can be uncompressed in Windows with WinRAR. All files are plain text files with Unix newlines (which may display incorrectly in Windows and Mac systems).

Sequences Domains
Source Data Standard Pfam Domain Stratified Statistics Tiered Stratified q-values
P. falciparum PlasmoDB 9.0 Pf.fa.gz Pf.dss.txt.gz Pf.tsq.txt.gz
P. vivax PlasmoDB 6.4 Pv.fa.gz Pv.dss.txt.gz Pv.tsq.txt.gz
P. knowlesi PlasmoDB 6.4 Pk.fa.gz Pk.dss.txt.gz Pk.tsq.txt.gz
P. yoelii PlasmoDB 6.4 Py.fa.gz Py.dss.txt.gz Py.tsq.txt.gz
P. chabaudi GeneDB 2010-07 Pc.fa.gz Pc.dss.txt.gz Pc.tsq.txt.gz
P. berghei GeneDB 2010-07 Pb.fa.gz Pb.dss.txt.gz Pb.tsq.txt.gz
B. bovis UniProt 2010-07-21 Bb.fa.gz Bb.dss.txt.gz Bb.tsq.txt.gz
T. annulata UniProt 2010-07-21 Ta.fa.gz Ta.dss.txt.gz Ta.tsq.txt.gz
T. parva UniProt 2010-07-21 Tp.fa.gz Tp.dss.txt.gz Tp.tsq.txt.gz
T. gondii ToxoDB 7.0 Tg.fa.gz Tg.dss.txt.gz Tg.tsq.txt.gz
N. caninum ToxoDB 7.0 Nc.fa.gz Nc.dss.txt.gz Nc.tsq.txt.gz
E. tenella ToxoDB 7.0 Et.fa.gz Et.dss.txt.gz Et.tsq.txt.gz
C. hominis CryptoDB 4.0 Ch.fa.gz Ch.dss.txt.gz Ch.tsq.txt.gz
C. muris CryptoDB 4.0 Cm.fa.gz Cm.dss.txt.gz Cm.tsq.txt.gz
C. parvum CryptoDB 4.0 Cp.fa.gz Cp.dss.txt.gz Cp.tsq.txt.gz

File formats

The protein sequences are in FASTA format, while all the domain prediction are HMMER3 domain tabular format (with additional columns in the case of the stratified predictions, see the DomStratStats page for more info).


Pseudogenes were removed from the original protein sequence files. All domain predictions were found using the HMMER 3.1b1 software, to compare our protein sequences against the Pfam 27 database of domains. Additional statistics and thresholds were computed using the DomStratStats 1.01. I used the following command, which pools statistics across organisms while keeping outputs separate.

# provide correct paths to hmmscan executable and Pfam files 
# the input files ORG.fa or ORG.fa.gz must be in the same directory 
perl -w hmmscan Pfam-A.hmm Pfam-A.hmm.dat \ 
	 Pf Pv Pk Py Pb Pc Bb Ta Tp Tg Nc Et Ch Cp Cm

The "Standard Pfam" uses the Pfam-curated "gathering" thresholds and only removes same-clan overlaps, the "Domain Stratified Statistics" predictions have no thresholds set (other than the mandatory domain overlap removal that precedes computing q-values and local FDRs), and the "Tiered Stratified q-values" predictions have a per-tier q-value threshold of 1e-4 (again, in addition to the preceding domain overlap removal).