- ⇑
You need Javascript for this menu to work properly!
The goal was to generate domain predictions for Plasmodium falciparum in particular, and predictions were generated on the rest of the Apicomplexa only to improve the q-value statistics, which are estimated across the set. This is because a single proteome has too little data to give good q-value estimates for most domain families (which are analyzed independently). For this reason, the P. falciparum proteome used is much newer than the rest of the proteomes, which are only auxiliary. However, I include all the data for completeness and in the hope that the domain predictions on the other organisms are useful despite some of their sequences being outdated.
All files are compressed with gzip, and can be uncompressed in Windows with WinRAR. All files are plain text files with Unix newlines (which may display incorrectly in Windows and Mac systems).
The protein sequences are in FASTA format, while all the domain prediction are HMMER3 domain tabular format (with additional columns in the case of the stratified predictions, see the DomStratStats page for more info).
Pseudogenes were removed from the original protein sequence files. All domain predictions were found using the HMMER 3.1b1 software, to compare our protein sequences against the Pfam 27 database of domains. Additional statistics and thresholds were computed using the DomStratStats 1.01. I used the following command, which pools statistics across organisms while keeping outputs separate.
# provide correct paths to hmmscan executable and Pfam files
# the input files ORG.fa or ORG.fa.gz must be in the same directory
perl -w 4allManyOrgs.pl hmmscan Pfam-A.hmm Pfam-A.hmm.dat \
Pf Pv Pk Py Pb Pc Bb Ta Tp Tg Nc Et Ch Cp Cm
The "Standard Pfam" uses the Pfam-curated "gathering" thresholds and only removes same-clan overlaps, the "Domain Stratified Statistics" predictions have no thresholds set (other than the mandatory domain overlap removal that precedes computing q-values and local FDRs), and the "Tiered Stratified q-values" predictions have a per-tier q-value threshold of 1e-4 (again, in addition to the preceding domain overlap removal).