Domain predictions for the Apicomplexa

The goal was to generate domain predictions for Plasmodium falciparum in particular, and predictions were generated on the rest of the Apicomplexa only to improve the q-value statistics, which are estimated across the set. This is because a single proteome has too little data to give good q-value estimates for most domain families (which are analyzed independently). For this reason, the P. falciparum proteome used is much newer than the rest of the proteomes, which are only auxiliary. However, I include all the data for completeness and in the hope that the domain predictions on the other organisms are useful despite some of their sequences being outdated.

Download our raw domain data

All files are compressed with gzip, and can be uncompressed in Windows with WinRAR. All files are plain text files with Unix newlines (which may display incorrectly in Windows and Mac systems).

	Sequences		Domains
	Source	Data	Standard Pfam	Domain Stratified Statistics	Tiered Stratified q-values
P. falciparum	PlasmoDB 9.0	Pf.fa.gz	Pf.ga.txt.gz	Pf.dss.txt.gz	Pf.tsq.txt.gz
P. vivax	PlasmoDB 6.4	Pv.fa.gz	Pv.ga.txt.gz	Pv.dss.txt.gz	Pv.tsq.txt.gz
P. knowlesi	PlasmoDB 6.4	Pk.fa.gz	Pk.ga.txt.gz	Pk.dss.txt.gz	Pk.tsq.txt.gz
P. yoelii	PlasmoDB 6.4	Py.fa.gz	Py.ga.txt.gz	Py.dss.txt.gz	Py.tsq.txt.gz
P. chabaudi	GeneDB 2010-07	Pc.fa.gz	Pc.ga.txt.gz	Pc.dss.txt.gz	Pc.tsq.txt.gz
P. berghei	GeneDB 2010-07	Pb.fa.gz	Pb.ga.txt.gz	Pb.dss.txt.gz	Pb.tsq.txt.gz
B. bovis	UniProt 2010-07-21	Bb.fa.gz	Bb.ga.txt.gz	Bb.dss.txt.gz	Bb.tsq.txt.gz
T. annulata	UniProt 2010-07-21	Ta.fa.gz	Ta.ga.txt.gz	Ta.dss.txt.gz	Ta.tsq.txt.gz
T. parva	UniProt 2010-07-21	Tp.fa.gz	Tp.ga.txt.gz	Tp.dss.txt.gz	Tp.tsq.txt.gz
T. gondii	ToxoDB 7.0	Tg.fa.gz	Tg.ga.txt.gz	Tg.dss.txt.gz	Tg.tsq.txt.gz
N. caninum	ToxoDB 7.0	Nc.fa.gz	Nc.ga.txt.gz	Nc.dss.txt.gz	Nc.tsq.txt.gz
E. tenella	ToxoDB 7.0	Et.fa.gz	Et.ga.txt.gz	Et.dss.txt.gz	Et.tsq.txt.gz
C. hominis	CryptoDB 4.0	Ch.fa.gz	Ch.ga.txt.gz	Ch.dss.txt.gz	Ch.tsq.txt.gz
C. muris	CryptoDB 4.0	Cm.fa.gz	Cm.ga.txt.gz	Cm.dss.txt.gz	Cm.tsq.txt.gz
C. parvum	CryptoDB 4.0	Cp.fa.gz	Cp.ga.txt.gz	Cp.dss.txt.gz	Cp.tsq.txt.gz

File formats

The protein sequences are in FASTA format, while all the domain prediction are HMMER3 domain tabular format (with additional columns in the case of the stratified predictions, see the DomStratStats page for more info).

Methods

Pseudogenes were removed from the original protein sequence files. All domain predictions were found using the HMMER 3.1b1 software, to compare our protein sequences against the Pfam 27 database of domains. Additional statistics and thresholds were computed using the DomStratStats 1.01. I used the following command, which pools statistics across organisms while keeping outputs separate.

# provide correct paths to hmmscan executable and Pfam files 
# the input files ORG.fa or ORG.fa.gz must be in the same directory 
perl -w 4allManyOrgs.pl hmmscan Pfam-A.hmm Pfam-A.hmm.dat \ 
	 Pf Pv Pk Py Pb Pc Bb Ta Tp Tg Nc Et Ch Cp Cm

The "Standard Pfam" uses the Pfam-curated "gathering" thresholds and only removes same-clan overlaps, the "Domain Stratified Statistics" predictions have no thresholds set (other than the mandatory domain overlap removal that precedes computing q-values and local FDRs), and the "Tiered Stratified q-values" predictions have a per-tier q-value threshold of 1e-4 (again, in addition to the preceding domain overlap removal).

Citations

2015-11-17. Alejandro Ochoa, John D Storey, Manuel Llinás, Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. PLoS Comput Biol. 11 e1004509. PubMed. PubMed Central. Article. arXiv 2014-09-23.
2016-01-27. Simon A Cobbold, Joana M Santos, Alejandro Ochoa, David H Perlman, Manuel Llinás. Proteome-wide analysis reveals widespread lysine acetylation of major protein complexes in the malaria parasite. Sci Rep. 2016;6:19722. PubMed. PubMed Central. Article.

Domain predictions for the Apicomplexa

using DomStratStats

by Alejandro Ochoa García

Download our raw domain data

File formats

Methods

Citations