# dPUC 2

## Domain Prediction Using Context

### by Alejandro Ochoa García, Manuel Llinás, Mona Singh

Extend your Pfam predictions without loss of precision using domain context!

VIIIA

- -

The dPUC (Domain Prediction Using Context) software improves domain prediction by considering every domain in the context of other domains, rather than independently as standard approaches do. Our specific framework maximized the probability of the data assuming a naive Bayes approximation, which reduces to a pairwise context problem, and we have shown that our probabilistic method is indeed more powerful than competing methods.

The dPUC 2.xx series is a major update from dPUC 1.0. For the average user, what matters is that now dPUC works with HMMER 3 and Pfam 24 onward, and dPUC is much faster than before. If you were a dPUC 1.0 user, you should know that inputs and outputs are completely different; the new setup is simpler due to the changes that HMMER3 and the new Pfams have brought. Lastly, there are improvements in the context network parametrization, most notably the addition of directed context. See the release notes for more information.

## Source code

Dpuc-2.00.tgz (73 KB)

Download our Perl source code, in a gzip-compressed tar archive file. It is a large project, with 13 packages and 3 scripts totalling 1746 lines of code (according to cloc).

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

This project is now on GitHub, where you can contribute and improve this software.

### Previous versions

Beta code Dpuc-2.00b.tgz (192 KB), which was larger than subsequent releases because it contained sample files that are no longer included.

The original 1.00 release is available and documented on the dPUC 1.0 website.

## Installation, dependencies

dPUC 2 requires Perl 5 and the Inline::C Perl package, which can be installed with cpan (which is included in standard Perl installations) by running the following command as the root user,

cpan Inline::C
and proceeding with their instructions (you may be asked to install additional packages). Do not hesitate to email me if the installation fails and you need further assistance.

This code also requires the lp_solve 5.5 library (but not the executable). In Fedora/RedHat Linux, you can install it using the following command as root,

yum install lpsolve-devel

Other than that, all you have to do is unpack the code and run the scripts from the directory that contains them. In this mode, the scripts and all packages should stay together in the same directory. You can also run my scripts from arbitrary directories.

You will also need HMMER3 (only the hmmpress and hmmscan executables are needed). Our code also assumes gzip is available in your system.

## Compiling C code

Once the two outside dependencies have been installed correctly, you will need to compile the C portion of dPUC. This compilation will happen automatically the first time 1dpuc2.pl is run, so you may simply run

perl -w 1dpuc2.pl
which will take a little while, then show the "usage" message. If you see an error about not finding the lp_solve library, you may have to change the hardcoded path to the library in my package DpucLpSolve.pm (please email me about this or any other errors, I'd like to know about them).

## Running the dPUC scripts

### HMM databases

The current version of dPUC only works with Pfam. You must download Pfam-A.hmm.gz and Pfam-A.hmm.dat.gz from the Pfam FTP site. It will be necessary to use HMMER3's hmmpress on this HMM database before hmmscan can use it. The following commands achieve that.

# uncompress HMM database first
gunzip Pfam-A.hmm.gz

# prepare HMM database for searching with HMMER3 (which provides hmmpress)
# four files will be generated, with names Pfam-A.hmm.h3{f,i,m,p}
hmmpress Pfam-A.hmm

# (optional) recompress the original HMM file (HMMER3 doesn't use it)
gzip Pfam-A.hmm

You may keep these file compressed from now on, dPUC will read them correctly!

### dPUC context count network

For completeness, my distribution includes a script that will generate the dPUC directed context network of domain family pair counts from Pfam-A.full.gz. However, downloading Pfam-A.full.gz takes a long time (in Pfam 27, that file is 4.1 GB!). For most users it makes more sense to download my precomputed network files, which I provide for every Pfam release starting with version 23.

dpucNet.pfam27.txt.gz (659 KB)

dpucNet.pfam26.txt.gz (524 KB)

dpucNet.pfam25.txt.gz (327 KB)

dpucNet.pfam24.txt.gz (286 KB)

dpucNet.pfam23.txt.gz (198 KB)

### Sample sequence file

To run through these examples, any FASTA format protein sequence file can be used. If you want, you can try this minimal sample file, sample.fa.gz, which consists of only 10 random proteins from the Plasmodium falciparum proteome. The two sample outputs are sample.27.txt.gz and sample.27.dpuc.txt.gz.

Sample outputs for these proteins were created using Pfam 27 and the 64-bit version of HMMER 3.1b1. Small differences in output arise even from using a 32-bit architecture, be aware of that. Additionally, HMMER 3.1b1 timestamps the output, so every run will appear to generate a different file if compared with hash functions such as SHA or MD5, so don't use them. It is better to compare each line with a tool such as zdiff, which is like diff but for compressed files.

### Synopsis of scripts

The following commands can be run on the sample file provided above, plus two Pfam files, all placed in the same directory as the code and called from that location. All input files may be compressed with gzip and they may be specified with or without the GZ extension. Output files are compressed with gzip, whether the outputs indicate a GZ extension or not. So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.

# create the dPUC context count network for your Pfam version
perl -w dpucNet.pl Pfam-A.full dpucNet.txt

# produce domain predictions with HMMER3 (provides hmmscan) with weak filters.
# this is the slowest step
perl -w 0runHmmscan.pl hmmscan Pfam-A.hmm sample.fa sample.txt
# (optional) compress output, our scripts will read it either way
gzip sample.txt

# produce dPUC predictions with default parameters
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt

### dpucNet.pl 1.00 - Extract the context count network from Pfam predictions

perl -w dpucNet.pl <Pfam-A.full> <dpucNet output>
The required inputs are

• Pfam-A.full: Input "full" domain prediction file from Pfam.
• dpucNet output: Directed context network of domain family pair counts.

This Pfam-specific script counts every ordered domain family pair observed in the full predictions of standard Pfam releases. The code is very optimized to parse such a large file quickly and to store the domain predictions with minimal memory usage (unfortunately, Pfam-A.full lists domains grouped first by family, whereas these family pair counts can only be calculated if predictions are grouped first by protein, so all the data must be in memory before it can be rearranged). The run for Pfam 27 used 3.2 GB of RAM, but it completes quickly (~5 minutes).

You will only need to run this command once per Pfam release. Moreover, you can skip this step altogether by downloading the precomputed dPUC context count networks I provide above (which I recommend, since downloading Pfam-A.full takes a very long time). This script is mostly provided for completeness and so the community can inspect it, find bugs, and otherwise improve it.

### 0runHmmscan.pl 1.01 - Get domain predictions from your protein sequences

perl -w 0runHmmscan.pl <hmmscan> <Pfam-A.hmm> <FASTA input> <output table>
The required inputs are

• hmmscan: the path to the HMMER3 hmmscan executable.
• Pfam-A.hmm: the path to your HMM library of choice (in HMMER3 format).
• FASTA input: the FASTA sequence file, may be compressed with gzip.
• output table: the hmmscan output is plain text table delimited by whitespace (always uncompressed).

This script doesn't simply run hmmscan with default parameters, in fact it changes a few options that are important for dPUC to work. In particular, it makes sure outputs report $$p$$-values in place of $$E$$-values, and it relaxes the heuristic $$p$$-value filters to allow more predictions through. This script sets the most stringent $$p$$-value threshold at 1e-4, which matches the domain candidate $$p$$-value threshold that dPUC uses, and which worked best in our benchmarks.

### 1dpuc2.pl 1.00 - Produce dPUC domain predictions from raw hmmscan data.

perl -w 1dpuc2.pl <Pfam-A.hmm.dat> <dpucNet> <input table> <output table>
The required inputs are

• Pfam-A.hmm.dat: Pfam annotation file, contains GA thresholds and nesting network.
• dpucNet: Directed context network of domain family pair counts.
• input table: the output from hmmscan (previous script).
• output table: input with most domains removed by dPUC. Format is identical to input.

It is your responsability to ensure that the same Pfam version is used for Pfam-A.hmm.dat, the dPUC context network, and the input table.

Note that dPUC 2 has many options that are hardcoded in this script, and for simplicity they are not documented. They are experimental features, but most of them haven't been useful in my benchmarks. If you know what you're doing, feel free to modify this script to change these parameters.

This script uses a permissive overlap definition, in which only overlaps that exceed 40 amino acids or 50% of the length of the smaller domain are removed. These parameters are hardcoded in the script, modify them if you know what you're doing!

## Compatibility

Code was tested with HMMER versions 3.0 and 3.1b1, against the Pfam 25 and 27 HMM databases. The code worked with Inline::C 0.53 and 0.62, and with lp_solve 5.5.2.0.

## Notes shared with other projects

My public code, description of my original Perl packages
The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

## Citations

Alejandro Ochoa, Manuel Llinás, and Mona Singh. Using context to improve protein domain identification. BMC Bioinformatics 2011, 12:90. 2011-03-31. Pubmed Article

Alejandro Ochoa, John Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. Submitted. arXiv 2014-09-23.

## Release notes

### 2014-09-18 - dPUC 2.00

This was the first public release of the second series of the code.

Compared to the 1.0 version, lots has changed. In summary, the 2.xx series works with HMMER3/Pfam 24 (and newer), and it is very highly optimized to reduce runtime. Importantly, the output format has changed (now it is simply like the HMMER3 domain tabular output files).

• Installation. It's easier now.
• The previous version had more dependencies, particularly more external Perl packages. The previous version required the executable lp_solve, whereas the current version requires the lp_solve library instead (note some installers include one but not the other). The new version requires the gzip executable to be available.
• The previous version included sample files with the source code, while the new version has source code separate from sample files (which are available on this web manual).
• The previous version required modifying a package, FilePaths.pm, to specify the paths of required Pfam and other files. The new version does not require any such source code modification to work (unless the lp_solve library is in an unusual place, I believe that is the only exception). All required files are now passed as arguments on the command line.
• Pipeline. It's simplified.
• The new version first runs HMMER3 on all proteins, then uses dPUC to filter those results as a separate step. The previous version would take each protein, run HMMER2 on it for both old modes (ls and fs, which no longer exist), then parse that result and run dPUC on it. The biggest disadvantage of the old way is I/O, the constant back and forth with writing very small files (single sequences) to disk for HMMER, parsing the output back in dPUC, and iterating. Also, since HMMER3 is now multithreaded, while dPUC is not, it may make more sense to run HMMER3 separely, perhaps on different machines with more cores, for example.
• The previous version included the Standard Pfam and dPUC predictions in the same output. The new version only shows dPUC. The new output format is exactly the HMMER3 domain tabular format (the input is simply filtered by dPUC, without altering any of the original columns or adding information). The previous version's output used to have more information, but it was hard to interpret, so it is currently ommited in favor of simplicity (this could change in the future).
• dPUC now has many hidden parameters that are not accessible on the command line. The previous version allowed the selection of an $$E$$-value threshold for candidate domains, but the new script has the equivalent threshold hardcoded. Again, this is in favor of simplicity. All parameters can be modified by modifying the 1dpuc2.pl script.
• The previous version required parsing large Pfam files before the first run, or use the files provided with the distribution. This is no longer required now that Pfam-A.hmm.dat exists (it wasn't available when dPUC 1.0 came out; Pfam-A.hmm.dat is so small now dPUC parses it at load time).
• There is no longer a daemon version of dPUC. The previous version used it for my website to generate predictions, but that functionality is also no longer available on the website (dPUC2 is so much faster and easier to use, that I expect people to be happier running it in their own machines).
Parametrization. The current dpucNet.pl differs from the equivalent procedure from dPUC 1.0 (which was three separate scripts) in important ways. These changes also affect how dPUC uses context scores for prediction. The changes allow for greater flexibility but also lead to modest improvements in statistical power.
• Domain family pairs are now ordered, or in other words, the network is directed. At load time, dPUC can treat the network as undirected with a score parameter that computes weighted averages of the counts of each direction (usage of this and other score parameters is currently not documented). The default is to keep the network fully directed.
• The previous procedure would output bitscores instead of raw counts. The new approach computes the bitscores when dPUC is loaded, it is very fast and provides the most flexibility since arbitrary scoring parameters can be tried immediately (instead of generating new network files for each parameter set).
• The counts are now of every domain family pair, rather than pairs counted only once per protein (so a protein with domain architecture A-A-B used to contribute a single A-B count, whereas now it contributes two A-B counts, in both cases also contributing an A-A count).
• The previous procedure applied an architecture count filter, whereas the new procedure outputs pair counts without any filters. The new approach allows for filtering the context network by pair count at dPUC load time, again providing maximum flexibility. We found that removing pair counts of 1 (the current default behavior) performs better than the previous architecture count filtering, presumably it removes more false positive context pairs per true positive context pair removed. The difference can be understood conceptually for a domain family pair that is observed in one architecture that was only observed once, but which contains such a family pair multiple times (the previous approach would remove such a family pair, the new approach will keep it and treat it as having higher confidence depending on the count).
• Other aspects of the score parametrization have changed and/or have been extended. For example, the prior count distribution may be more complex, the shape of the count and score distributions can be altered by exponentating, shifting, and scaling. However, in my benchmarks I haven't found useful ways of wielding these parameters, so by default they are not used. The default prior "pseudocount" is the same for every domain family pair, now fixed at $$2^{-23}$$ (about 1e-7) times the number of family pair counts observed (which is comparable to the prior of the previous version). The new parametrization makes it clear that unobserved domain pairs will have a score of approximately -23 bits.
• To agree with my other project, DomStratStats, dPUC now assumes $$p$$-values, rather than $$E$$-values, are present in the input from HMMER3. The default threshold for candidate domains is $$p \le \mbox{1e-4}$$, which is comparable to the previous recommendation with the exception that now the threshold is independent of the size of the database used.
• Runtime performance. In my benchmarks, dPUC2 is ~16x faster than its predecessor (excluding HMMER runtime, although it is itself reportedly 100x faster from version 2.3.2 to 3.0).
• The way dPUC interacts with lp_solve has been completely revamped. The previous approach generated a text file for every protein and threshold, which was the input for the lp_solve executable, the output of which was parsed by dPUC and processed to become the final domain predictions. Now dPUC and lp_solve interact more directly using the lp_solve C library. The previous approach was expensive not just because of the I/O overhead, but also because each lp_solve run required starting a new process via a process fork, which is also more expensive than the newest solution.
• The dPUC constraints used for lp_solve were also optimized. Although there are now more statements, they are stronger constraints that make it more likely that lp_solve will either find the exact integer solution when the continuous relaxation is solved, or it will converge more quickly to the integer solution at the branch-and-bound steps.
• dPUC now detects many trivial cases that do not need to be sent to lp_solve. The most useful one is the case when, after the dPUC positive elimination, the domains that are left do not overlap and do not have negative context with each other, in which case they constitute the final solution (in the previous approach, they were sent to lp_solve anyway, which would solve the problem much more slowly, especially if too many domains are present). Detecting trivial problems also scales much better than solving them with lp_solve (worst-case time complexities of $$O(m^2)$$ versus $$O(2^m)$$ for $$m$$ domains).
• There are two hidden "pre-filter" options that nobody should used. The idea was to speed up dPUC by removing overlaps in the input domains, either using $$p$$-value ranking or using one of my extensions of CODD (also using $$p$$-value ranking but additionally requiring strictly positive directed context between all predictions, always using the top-ranking domain as the only trusted domain). Both forms make dPUC faster but lead to considerable loss of statistical power, the second version in particular is terrible. I may remove these pre-filters completely in the future.

Compared to the beta version, many packages were reorganized, in general for the better. Also, I include code to generate the dPUC count network (the beta version simply provided the network for Pfam 24, and that network had strange filters that I have removed in this new public version).

### 2014-02-37 - dPUC 2.00 beta

I prepared this version for Juliana Bernardes, but until I posted it with the final version 2.00, I believe nobody else had seen this version, so I do not consider it a public version.

### 2011-02-06 - dPUC 1.0

Last change recorded on the public source code. See the dPUC (1.0) website and our paper for more information.

VIIIA

History