# DomStratStats

## stratified statistics for protein domains

### by Alejandro Ochoa García, John D Storey, Manuel Llinás, Mona Singh

Compute domain q-values and local FDRs, and tiered q-values (combining HMMER3 domain and sequence data).

VIIIA

- -

Here you can download our code, and learn how to use it. This page is the software's manual.

The goal of this software is to bring better statistics to protein sequence analysis. The standard approaches are based on $$p$$-values and $$E$$-values, but here we introduce $$q$$-values and lFDRs (local False Discovery Rates) for protein domains. They key to make $$q$$-values and lFDRs work better is by stratifying, in this case analyzing each domain family separately, which best balances noise across domain families.

Practically, the "domain $$q$$-value" approach will be the most useful. Theoretically, we proved in our work that lFDRs are optimal; however, we found in benchmarks that lFDR estimates do not perform as well as $$q$$-values because some HMMER3 $$p$$-values are bad (for certain repetitive domain families, in particular). We found that $$q$$-values are much more robust than lFDRs given the $$p$$-values that HMMER3 gives. Our code provides lFDRs anyway for research purposes, and with the hope that they can be better used in the future, when domain $$p$$-values improve.

Lastly, we included an even more experimental mode, "tiered $$q$$-values". Our tiered analysis combines the HMMER sequence and domain $$p$$-values into a joint assessment of significance based on the FDR. On the plus side, it produces even more predictions at a fixed combined FDR than the previous approaches. However, the theoretical FDR is too far off of the empirical FDR, which means that you should exercise caution choosing which threshold to use.

## Source code

DomStratStats-1.01.tgz (46 KB)

Download our Perl source code, in a gzip-compressed tar archive file. It has 8 packages and 5 scripts totalling 1328 lines of code (according to cloc).

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

### Previous versions

Code DomStratStats-1.00.tgz (35 KB) and manual 1.00.

## Installation

Our perl code is self-contained, so only Perl 5 is required (no external perl packages are needed). All you have to do is unpack the code and run the scripts from the directory that contains them. In this mode, the scripts and all packages should stay together in the same directory. You can also run my scripts from arbitrary directories.

You will also need HMMER3 (only the hmmpress and hmmscan executables are needed). Our code also assumes gzip is available in your system.

## Running the DomStratStats scripts

### Recommendations

The scripts do not come with default thresholds for $$q$$-values or lFDRs. Different thresholds will allow for different balances of true and false positives. If you want predictions that are comparable in quality to Pfam, our benchmarks suggest a $$q$$-value threshold of 4e-4 (most recommended), or an lFDR threshold of 2.5e-2 (lFDRs are less recommended in general), or a tiered $$q$$-value threshold of 1e-4 (for each tier, but FDRs are inaccurate in this mode). But in some ways Pfam is too conservative, so if you prefer you can set more lenient thresholds. For example, $$q \le \mbox{1e-3}$$ should give one false positive for every 1000 domains (although the true FDR might be higher if your sequences contain coiled coils, transmembrane domains, or other low-complexity regions, so watch out!).

The FDRs and lFDRs we compute become more accurate when datasets are larger. In practice, we found that the proteins from a single organism are not enough to give optimal results, but grouping many related organisms performs very well (for example, I usually work with 15 organisms of the phylum Apicomplexa). The examples below break this rule, they only use Plasmodium falciparum for the sake of having small inputs and outputs. In general, it is better to start from a larger collection of proteins that represent the diversity of the domains contained in the organism you're studying (such as the Apicomplexa, in the case of P. falciparum, as I did for this other project). The lFDRs are much more sensitive than $$q$$-values to the size of the input data.

### HMM databases

If you choose Pfam, which is what we use in our examples, you should download Pfam-A.hmm.gz from the Pfam FTP site. It will be necessary to use HMMER3's hmmpress on this HMM database before hmmscan can use it. The following commands achieve that.

# uncompress HMM database first gunzip Pfam-A.hmm.gz # prepare HMM database for searching with HMMER3 (which provides hmmpress) # four files will be generated, with names Pfam-A.hmm.h3{f,i,m,p} hmmpress Pfam-A.hmm # (optional) recompress the original HMM file (HMMER3 doesn't use it) gzip Pfam-A.hmm

For Pfam, it is optional but recommended to download Pfam-A.hmm.dat.gz as well, it helps our code recognize nested domains. You may keep this file compressed, DomStratStats will read it correctly!

Another popular HMM database is Superfamily. You will need to register to obtain the file hmmlib_1.75.gz. You can use it exactly the same way as Pfam-A.hmm.gz, but you will also need to run hmmpress on it. There is no Superfamily equivalent to Pfam-A.hmm.dat.gz, so unfortunately nested domains will not be handled correctly. There are some important ways in which Superfamily HMMs differ from Pfam; however, our code does not treat Superfamily specially. In all cases, we stratify the $$q$$-value and lFDR analysis by HMM.

### Sample sequence file

To run through these examples, any FASTA format protein sequence file can be used. If you want, you can try this file, PLAF7.fa.gz, which is the Plasmodium falciparum proteome from PlasmoDB version 11.0, with pseudogenes removed, it contains 5441 proteins and is 2.3MB in size (compressed).

All outputs for these proteins are listed below using Pfam 27 and Superfamily 1.75, and the 64-bit version of HMMER 3.1b1. No sample files are included with our source code. Small differences in output arise even from using a 32-bit architecture, be aware of that. Additionally, HMMER 3.1b1 timestamps the output, so every run will appear to generate a different file if compared with hash functions such as SHA or MD5, so don't use them. It is better to compare each line with a tool such as zdiff, which is like diff but for compressed files.

 Description Pfam 27 Superfamily 1.75 Raw HMMER3 output PLAF7.p.txt.gz PLAF7.s.txt.gz HMMER3 output without overlaps PLAF7.po.txt.gz PLAF7.so.txt.gz domStratStats output without thresholds PLAF7.pod.txt.gz PLAF7.sod.txt.gz domStratStats output, $$q \le \mbox{4e-4}$$ PLAF7.pod.q4e-4.txt.gz PLAF7.sod.q4e-4.txt.gz domStratStats output, $$\mbox{lFDR} \le \mbox{2.5e-2}$$ PLAF7.pod.l2.5e-2.txt.gz PLAF7.sod.l2.5e-2.txt.gz domStratStats output, $$\mbox{FDR|lFDR} \le \mbox{4e-4}$$ PLAF7.pod.fl4e-4.txt.gz PLAF7.sod.fl4e-4.txt.gz tieredStratQ output, $$q_{seq},q_{dom|seq} \le \mbox{1e-4}$$ PLAF7.pot.1e-4.txt.gz PLAF7.sot.1e-4.txt.gz tieredStratQ output, $$q_{seq} \le \mbox{1e-4}$$ (no $$q_{dom|seq}$$ threshold) PLAF7.pot.1e-4,1.txt.gz PLAF7.sot.1e-4,1.txt.gz

The only exception is the sample data for script 4allManyOrgs.pl, which is a much larger dataset from a different project, which was only run on Pfam 27.

### Synopsis of scripts

The following commands can be run on the sample file provided above, plus two Pfam files, all placed in the same directory as the code and called from that location. All input files may be compressed with gzip and they may be specified with or without the GZ extension. Output files are compressed with gzip, whether the outputs indicate a GZ extension or not. So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.

# produce domain predictions with HMMER3 (provides hmmscan) with weak filters. # this is the slowest step perl -w 0runHmmscan.pl hmmscan Pfam-A.hmm PLAF7.fa PLAF7.p.txt # (optional) compress output, our scripts will read it either way gzip PLAF7.p.txt # remove overlaps, ranking by p-value, creating an intermediate output file perl -w 1noOvs.pl PLAF7.p.txt PLAF7.po.txt Pfam-A.hmm.dat # add domain q-value, local FDR, and FDR|lFDR columns (no thresholds set) perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.txt #... and set a q-value threshold perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.q4e-4.txt 4e-4 #... or set a local FDR threshold perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.l2.5e-2.txt 1 2.5e-2 #... or set an FDR threshold via equal local FDR thresholds perl -w 2domStratStats.pl PLAF7.fa PLAF7.po.txt PLAF7.pod.fl4e-4.txt 1 1 4e-4 # tiered q-values (for sequence and domain p-values), with mandatory threshold set perl -w 3tieredStratQ.pl PLAF7.fa PLAF7.po.txt PLAF7.pot.1e-4.txt 1e-4 #... or threshold set to sequence q-values only perl -w 3tieredStratQ.pl PLAF7.fa PLAF7.po.txt PLAF7.pot.1e-4,1.txt 1e-4 1 # run entire pipeline on multiple organisms, pooling their stats # pass prefixes ORG instead of ORG.fa or ORG.fa.gz # outputs follow patterns (see below) perl -w 4allManyOrgs.pl hmmscan Pfam-A.hmm Pfam-A.hmm.dat \ Pf Pv Pk Py Pb Pc Bb Ta Tp Tg Nc Et Ch Cp Cm

All examples were also run for Superfamily 1.75 (outputs are PLAF7.s* rather than PLAF7.p*), and all output files are available above. The only exception is the data for 4allManyOrgs.pl (see above).

### 0runHmmscan.pl 1.01 - Get domain predictions from your protein sequences

perl -w 0runHmmscan.pl <hmmscan> <Pfam-A.hmm> <FASTA input> <output table>
The required inputs are

• hmmscan: the path to the HMMER3 hmmscan executable.
• Pfam-A.hmm: the path to your HMM library of choice (in HMMER3 format).
• FASTA input: the FASTA sequence file, may be compressed with gzip.
• output table: the hmmscan output is plain text table delimited by whitespace (always uncompressed).

This script doesn't simply run hmmscan with default parameters, in fact it changes a few options that are important for DomStratStats to work. In particular, it makes sure outputs report $$p$$-values in place of $$E$$-values, and it relaxes the heuristic $$p$$-value filters to allow more predictions through. If we desire predictions with $$q \le \mbox{4e-4}$$ on UniRef50, we found that we must keep $$p$$-values as large as 1e-5 (however, for most families the effective $$p$$-value threshold is much smaller). This script sets the most stringent $$p$$-value threshold at 1e-4 to ensure that nothing is missed by the domain stratified $$q$$-value analysis.

Note that the HMMER3 heuristic filters were not optimized in the context of tiered $$q$$-values, it is possible that they can be modified for that use in particular (for example, by not setting a final domain $$p$$-value threshold in HMMER3, which currently equals the sequence $$p$$-value threshold of 1e-4).

### 1noOvs.pl 1.01 - Removes overlapping domains ranking by $$p$$-value

perl -w 1noOvs.pl <input table> <output table> [<Pfam-A.hmm.dat>]
The required inputs are

• input table: the output from hmmscan (previous script).
• output table: input with some domains removed. Format is identical to input.

This is an additional optional input, which is specific to Pfam.

• Pfam-A.hmm.dat: only used to identify domains that may nest.

We found that the easiest way to compute correct FDRs and lFDRs is to first remove overlaps between domains, raking by $$p$$-value (in the absence of a better starting statistic, this performs very well). We assume that overlap removal will be performed ultimately between the best predictions, which is common practice between predictions of the same Pfam clan or superfamily. This scripts removes overlaps between all families, not relying on curated relationships between HMMs. It performs excellently if the ultimate threshold is reasonable (for example, $$q \le \mbox{4e-4}$$), but this step might lead to overly conservative statistics (particularly lFDRs appear much worse than they really are) for much weaker predictions.

This script uses a permissive overlap definition, in which only overlaps that exceed 40 amino acids or 50% of the length of the smaller domain are removed. These parameters are hardcoded in the script, modify them if you know what you're doing!

### 2domStratStats.pl 1.00 - Computes and adds domain $$q$$-values, lFDRs, and FDR|lFDR

perl -w 2domStratStats.pl <FASTA input> <input table> <output table> \ [<q-value> <local FDR> <FDR|lFDR>] 
The required inputs are

• FASTA input: the FASTA sequence file, used to count all proteins.
• input table: the output from 1noOvs.pl (previous script).
• output table: every input line has three columns added (unless thresolds are set).

The following optional thresholds may be set. Either or all can be set simultaneously, which is supported although it is unusual and hard to interpret, so avoid using this feature. Use 1 for any threshold you don't wish to set but must list (for example, to only set an FDR|lFDR threshold, set both the $$q$$-value and lFDR thresholds to 1).

• $$q$$-value: stratified domain $$q$$-value threshold. Sets FDR per family.
• local FDR: stratified domain lFDR threshold. Sets minimum posterior error probability per family.
• FDR|lFDR: threshold on combined FDR across domain families implied by a stratified domain lFDR threshold. Sets a stratified domain lFDR threshold indirectly, by specifying the resulting combined FDR instead.

This is the most basic approach to FDR control. If you're not researching domain predictions, and just want domain predictions at an FDR equal to what Pfam gives you, set a $$q$$-value threshold ($$q \le \mbox{4e-4}$$ is best). Regard lFDRs and FDR|lFDR as experimental features worthy of further research, at least for the moment.

Note that the FDR|lFDR is monotonic with the stratified domain lFDR, but not with stratified domain $$q$$-values. When lFDRs are accurate (this is a big if!), they provide the optimal way of controlling the combined FDR through FDR|lFDR.

### 3tieredStratQ.pl 1.00 - Computes and adds $$q$$-values for sequences and domains

perl -w 3tieredStratQ.pl <FASTA input> <input table> <output table> \ <seq q-value> [<dom q-value>] 
The required inputs are

• FASTA input: the FASTA sequence file, used to count all proteins.
• input table: the output from 1noOvs.pl (previous script).
• output table: every input line that passes thresholds has two columns added.
• seq $$q$$-value: stratified sequence $$q$$-value threshold. Sets the desired FDR at the sequence level, which can be interpreted as the FDR of calling proteins as having at least one member of a given domain family. This threshold is mandatory because they define conditional domain $$q$$-values. If no domain $$q$$-value threshold is set, then this same value is applied to conditional domain $$q$$-values, in which case the domain FDR (unconditionally) is approximately twice the $$q$$-value threshold indicated.

You can optionally specify a domain $$q$$-value threshold that differs from the sequence $$q$$-value threshold.

• dom $$q$$-value: threshold on the stratified domain $$q$$-value computed conditionally on the sequence threshold. The expected domain FDR (unconditionally) is approximately the sum of the sequence and domain $$q$$-value thresholds.

This "tiered" approach relies on setting a threshold on stratified sequence $$q$$-values, then computing $$q$$-values on the remaining domains (a $$q$$-value which gives an FDR conditional on the first sequence FDR filter) and setting a threshold too. This way the information of repeating domains can be used, like Pfam does except here thresholds are picked automatically rather than through expert curation. However, as I mentioned above, this approach does not control the FDR accurately, so $$q$$-value thresholds must be chosen using benchmarks.

### 4allManyOrgs.pl 1.00 - Get final domain predictions from multiple sequence files

perl -w 4allManyOrgs.pl <hmmscan> <Pfam-A.hmm> <Pfam-A.hmm.dat> <orgs>...
The required inputs are

• hmmscan: the path to the HMMER3 hmmscan executable.
• Pfam-A.hmm: the path to your HMM library of choice (in HMMER3 format).
• Pfam-A.hmm.dat: Pfam-specific annotations (GA thresholds, clans, nesting).
• orgs...: list of prefixes to identify inputs and outputs (see below).

This complex script runs the entire pipeline on several organisms, producing three kinds of outputs that one may wish to compare, namely the Standard Pfam (which use the GA thresholds and other processing), the Domain Stratified Statistics (without thresholds), and the Tiered Stratified $$q$$-values (with a hardcoded threshold of 1e-4 per tier).

For each prefix ORG provided, the input is assumed to be ORG.fa or ORG.fa.gz (either will be read correctly), and the outputs will be (all compressed with gzip by default, GZ extension ommited below)

• ORG.txt: Raw HMMER3 output with p-values and using permissive thresholds
• ORG.noOvs.txt: HMMER3 output with overlaps removed (in the permissive sense)
• ORG.dss.txt: Domain Stratified Statistics output, adds 3 columns
• ORG.tsq.txt: Tiered Stratified $$q$$-values output, filtered and adds 2 columns
• ORG.ga.txt: Standard Pfam output, with 'GA' thresholds and intra-clan overlaps removed

Note that, while most steps are run independently per organism, all stratified statistics are estimated by pooling the $$p$$-value distributions across organisms (always stratified by domain family), which are usually better than single-organism estimates. Moreover, the outputs remain separated by organism, which may be more convenient than simply merging the initial proteomes into a single sequence file (which produces a single output for all organisms).

This is the most inflexible of all these scripts, but it showcases the intended use of the API, which is currently not documented, particularly the use of multiple input files. I hope you modify it to suit your needs, for example to change the output patterns, or if you only want the domain predictions with a lFDR threshold, or if you want intermediate files to be deleted upon completion.

## Compatibility

Code was tested with HMMER versions 3.0 and 3.1b1, against the Pfam 25 and 27, and Superfamily 1.75 HMM databases.

## Notes shared with other projects

My public code, description of my original Perl packages
The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

## Citation

Alejandro Ochoa, John Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. Submitted. arXiv 2014-09-23.

## Release notes

### 2014-09-18 - DomStratStats 1.01

Added script 4allManyOrgs.pl

3tieredStratStats.pl was renamed to 3tieredStratQ.pl, since this is a more accurate description (this script does not use lFDRs, only $$q$$-values).

Reorganized internal packages, including creating and renaming packages and functions, for additional clarity to anybody interested in using the undocumented API. I also added functions not used by DomStratStats, in order to ensure package compatibility with related projects (particularly dPUC 2)

Corrected documentation errors. Also made this manual more concise.

Outputs are unchanged, since no algorithms changed.

### 2014-06-23 - DomStratStats 1.00

This was the first public release of the code.