DomStratStats

stratified statistics for protein domains

by Alejandro Ochoa García, John D Storey, Manuel Llinás, Mona Singh

Compute domain q-values and local FDRs, and tiered q-values (combining HMMER3 domain and sequence data).

Thumbnail
VIIIA

en-us es-mx - - Email Me

Here you can download our code, and learn how to use it. This page is the software's manual.

About DomStratStats

The goal of this software is to bring better statistics to protein sequence analysis. The standard approaches are based on p-values and E-values, but here we introduce q-values and local FDRs (False Discovery Rates) for protein domains. They key to make q-values and local FDRs work better is by stratifying, in this case analyzing each domain family separately, which best balances noise across domain families.

Practically, the "domain q-value" approach will be the most useful. Theoretically, we proved in our work that "local FDRs" are optimal; however, we found in benchmarks that local FDR estimates do not perform as well as q-values because some HMMER3 p-values are bad (for certain repetitive domain families, in particular). We found that q-values are much more robust than local FDRs given the p-values that HMMER3 gives. Our code provides local FDRs anyway for research purposes, and with the hope that they can be better used in the future, when domain p-values improve.

Lastly, we included an even more experimental mode, "tiered q-values". Our tiered analysis combines the HMMER sequence and domain p-values into a joint assessment of significance based on the FDR. On the plus side, it produces even more predictions at a fixed combined FDR than the previous approaches. However, the theoretical FDR is too far off of the empirical FDR, which means that you should exercise caution choosing which threshold to use.

Source code

/dl.png DomStratStats-1.00.tgz (35 KB)

Download our Perl source code, in a gzip-compressed tar archive file. It is fairly small, 6 packages and 4 scripts totalling 953 lines of code (according to cloc).

Installation

Our perl code is self-contained, so only Perl 5 is required (no external perl packages are needed). All you have to do is unpack the code and run the scripts from the directory that contains them. In this mode, the scripts and all packages should stay together in the same directory.

If you want to run the scripts from other directories, let's say you placed our code in /myPath, you have at least two options. One is to edit the source code, you'll have to find the lines in the script and package headers ' use lib "." ' and replace "." with "/myPath" (absolute paths work best). Alternatively, you can call the perl executable with option -I/myPath (put it just before or after -w in the examples below) so perl can find the packages. See this article for more alternatives.

You will also need HMMER3 (only the hmmpress and hmmscan executables are needed).

Running the DomStratStats scripts

Recommendations

The scripts do not come with default thresholds for q-values or local FDRs. Different thresholds will allow for different balances of true and false positives. If you want predictions that are comparable in quality to Pfam, our benchmarks suggest a q-value threshold of 4e-4 (most recommended), or a local FDR threshold of 2.5e-2 (local FDRs are less recommended in general), or a tiered q-value threshold of 1e-4 (for each tier, but FDRs are inaccurate in this mode). But in some ways Pfam is too conservative, so if you prefer you can set more lenient thresholds. For example, q≤1e-3 should give one false positive for every 1000 domains (although the true FDR might be higher if your sequences contain coiled coils, transmembrane domains, or other low-complexity regions, so watch out!). coiled coils, transmembrane domains, or other low-complexity regions, so watch out!).

The FDRs and local FDRs we compute become more accurate when datasets are larger. In practice, we found that the proteins from a single organism are not enough to give optimal results, but grouping many related organisms performs very well (for example, I usually work with 15 organisms of the phylum Apicomplexa). The examples below break this rule, they only use Plasmodium falciparum for the sake of having small inputs and outputs. In general, it is better to start from a larger collection of proteins that represent the diversity of the domains contained in the organism you're studying (such as the Apicomplexa, in the case of P. falciparum, as I did for this other project). Local FDRs are much more sensitive than q-values to the size of the input data.

HMM databases

If you choose Pfam, which is what we use in our examples, you should download Pfam-A.hmm.gz from the Pfam FTP site. It will be necessary to use HMMER3's hmmpress on this HMM database before hmmscan can use it. The following commands achieve that.

# uncompress HMM database first 
gunzip Pfam-A.hmm.gz 

# prepare HMM database for searching with HMMER3 (which provides hmmpress) 
# four files will be generated, with names Pfam-A.hmm.h3{f,i,m,p} 
hmmpress Pfam-A.hmm 

# (optional) recompress the original HMM file (HMMER3 doesn't use it) 
gzip Pfam-A.hmm

For Pfam, it is optional but recommended to download Pfam-A.hmm.dat.gz as well, it helps our code recognize nested domains. You may keep this file compressed, DomStratStats will read it correctly!

Another popular HMM database is Superfamily. You will need to register to obtain the file hmmlib_1.75.gz. You can use it exactly the same way as Pfam-A.hmm.gz, but you will also need to run hmmpress on it. There is no Superfamily equivalent to Pfam-A.hmm.dat.gz, so unfortunately nested domains will not be handled correctly. There are some important ways in which Superfamily HMMs differ from Pfam; however, our code does not treat Superfamily specially. In all cases, we stratify the q-value and local FDR analysis by HMM.

Sample sequence file

To run through these examples, any FASTA format protein sequence file can be used. If you want, you can try this file, PLAF7.fa.gz, which is the Plasmodium falciparum proteome from PlasmoDB version 11.0, with pseudogenes removed, it contains 5441 proteins and is 2.3MB in size (compressed).

All outputs for these proteins are in this sample/ directory, using Pfam 27 and Superfamily 1.75, and the 64-bit version of HMMER 3.1b1. No sample files are included with our source code. Small differences in output arise even from using a 32-bit architecture, be aware of that. Additionally, HMMER 3.1b1 timestamps the output, so every run will appear to generate a different file if compared with hash functions such as SHA or MD5, so don't use them. It is better to compare each line with a tool such as zdiff, which is like diff but for compressed files.

Synopsis of scripts

The following commands can be run on the sample file provided above, plus two Pfam files, all placed in the same directory as the code and called from that location. All input files may be compressed with gzip and they may be specified with or without the GZ extension. Output files are compressed with gzip, whether the outputs indicate a GZ extension or not. So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.

# produce domain predictions with HMMER3 (provides hmmscan) with weak filters. 
# this is the slowest step 
perl -w 0runHmmscan.pl hmmscan Pfam-A.hmm PLAF7.fa PLAF7.p.txt 
# (optional) compress output, our scripts will read it either way 
gzip PLAF7.p.txt 

# remove overlaps, ranking by p-value, creating an intermediate output file 
perl -w 1noOvs.pl PLAF7.p.txt PLAF7.po.txt Pfam-A.hmm.dat 

# add domain q-value, local FDR, and FDR|lFDR columns (no thresholds set) 
perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.txt 
#... and set a q-value threshold 
perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.q4e-4.txt 4e-4 
#... or set a local FDR threshold 
perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.l2.5e-2.txt 1 2.5e-2 
#... or set an FDR threshold via equal local FDR thresholds 
perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.fl4e-4.txt 1 1 4e-4 

# tiered q-values (for sequence and domain p-values), with mandatory threshold set 
perl -w 3tieredStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pot.1e-4.txt 1e-4 
#... or threshold set to sequence q-values only 
perl -w 3tieredStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pot.1e-4,1.txt 1e-4 1

All examples were also run for Superfamily 1.75 (outputs are PLAF7.s* rather than PLAF7.p*), and all output files are available in the sample/ directory.

0runHmmscan.pl - Get domain predictions from your protein sequences

perl -w 0runHmmscan.pl <hmmscan> <Pfam-A.hmm> <FASTA input> <output table>
The required inputs are

This script doesn't simply use hmmscan with default parameters, in fact it changes a few options that are important for DomStratStats to work. In particular, it makes sure outputs report p-values in place of E-values, and it relaxes the heuristic p-value filters to allow more predictions through. If we desire predictions with q<4e-4 on UniRef50, we found that we must keep p-values as large as 1e-5 (for most families the effective p-value threshold is much smaller, though). This script sets the most stringent p-value threshold at 1e-4 to ensure that nothing is missed by the domain stratified q-value analysis.

Note that the HMMER3 heuristic filters were not optimized in the context of tiered q-values, it is possible that they can be modified for that use in particular (for example, by not setting a final domain p-value threshold in HMMER3, which currently equals the sequence p-value threshold of 1e-4).

1noOvs.pl - Removes overlapping domains ranking by p-value

perl -w 1noOvs.pl <input table> <output table> [<Pfam-A.hmm.dat>]
The required inputs are

This is an additional optional input, which is specific to Pfam.

We found that the easiest way to compute correct FDRs and local FDRs is to first remove overlaps between domains, raking by p-value (in the absence of a better starting statistic, this performs very well). We assume that overlap removal will be performed ultimately between the best predictions, which is common practice between predictions of the same Pfam clan or superfamily. This scripts removes overlaps between all families, not relying on curated relationships between HMMs. It performs excellently if the ultimate threshold is reasonable (for example, q<4e-4), but this step might lead to overly conservative statistics (particularly local FDRs appear much worse than they really are) for much weaker predictions.

This script uses a permissive overlap definition, in which only overlaps that exceed 40 amino acids or 50% of the length of the smaller domain are removed. These parameters are hardcoded in the script, modify them if you know what you're doing!

2domStratStats.pl - Computes and adds domain q-values, local FDRs, and FDR|lFDR

perl -w 2domStratStats.pl <FASTA input> <input table> <output table> \
  [<q-value> <local FDR> <FDR|lFDR>]
The required inputs are

The following optional thresholds may be set. Either or all can be set simultaneously, which is supported although it is unusual and hard to interpret, so avoid using this feature. Use 1 for any threshold you don't wish to set but must list (for example, to only set an FDR|lFDR threshold, set both the q-value and local FDR thresholds to 1).

This is the most basic approach to FDR control. If you're not researching domain predictions, and just want domain predictions at an FDR other than what Pfam gives you, set a q-value threshold (q≤4e-4 is best). Regard local FDRs and FDR|lFDR as experimental features worthy of further research, at least for the moment.

Note that the FDR|lFDR is monotonic with the stratified domain local FDR (but not with stratified domain q-values). When local FDRs are accurate (this is a big if!), they provide the optimal way of controlling the combined FDR through FDR|lFDR.

3tieredStratStats.pl - Computes and adds q-values for sequences and domains

perl -w 3tieredStratStats.pl <FASTA input> <input table> <output table> \
  <seq q-value> [<dom q-value>]
The required inputs are

You can optionally specify a domain q-value threshold.

This "tiered" approach relies on setting a threshold on stratified sequence q-values, then computing q-values on the remaining domains (a q-value which gives an FDR conditional on the first sequence FDR filter) and setting a threshold too. This way the information of repeating domains can be used, like Pfam does except here thresholds are picked automatically rather than through expert curation. However, as I mentioned above, this approach does not control the FDR accurately, so q-value thresholds must be chosen using benchmarks.

Description of original Perl packages

To those brave souls that want to look at the source code, here's a brief description of what each package does. I apologize that my code is not as well organized as I'd like it to be.

Citation

Alejandro Ochoa, John Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. In preparation.

Compatibility

Code was tested with HMMER versions 3.0 and 3.1b1, against the Pfam 25 and 27, and Superfamily 1.75 HMM databases. This code was only tested with Perl 5.18 (but should work for any version ≥5). The code is expected to work on any Linux and MacOS, but let me know otherwise.

Want a Windows version?

The code will not work on Windows machines because it uses the gzip executable and it also uses Unix pipes. However, if you install the PerlIO::gzip package, or forgo working with compressed files, the code could run (with some adjustments, contact me for more info).

License

Copyright © 2014 Alejandro Ochoa, Singh Research Group, Princeton University

This file is part of DomStratStats.

DomStratStats is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

DomStratStats is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with DomStratStats. If not, see http://www.gnu.org/licenses/.

Release notes

2014-06-23 - DomStratStats 1.00

This was the first public release of the code.

VIIIA

History