dPUC 2

Domain Prediction Using Context

by Alejandro Ochoa García, Manuel Llinás, Mona Singh

Extend your Pfam predictions without loss of precision using domain context!

Thumbnail
VIIIA

en-us es-mx - - Email Me

Here you can download our code, and learn how to use it. This page is the software's manual.

About dPUC 2

The dPUC (Domain Prediction Using Context) software improves domain prediction by considering every domain in the context of other domains, rather than independently as standard approaches do. Our specific framework maximized the probability of the data assuming a naive Bayes approximation, which reduces to a pairwise context problem, and we have shown that our probabilistic method is indeed more powerful than competing methods.

The dPUC 2.xx series is a major update from dPUC 1.0. For the average user, what matters is that now dPUC works with HMMER 3 and Pfam 24 onward, and dPUC is much faster than before. If you were a dPUC 1.0 user, you should know that inputs and outputs are completely different; the new setup is simpler due to the changes that HMMER3 and the new Pfams have brought. Lastly, there are improvements in the context network parametrization, most notably the addition of directed context. See the release notes for more information.

Source code

/dl.png Dpuc-2.02.tgz (79 KB)

Download our Perl source code, in a gzip-compressed tar archive file. It is a large project, with 13 packages and 3 scripts totalling 1843 lines of code (according to cloc).

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

This project is now on GitHub, where you can contribute and improve this software.

Previous versions

Code Dpuc-2.01.tgz (78 KB) and manual 2.01.

Code Dpuc-2.00.tgz (73 KB) and manual 2.00.

Beta code Dpuc-2.00b.tgz (192 KB), which was larger than subsequent releases because it contained sample files that are no longer included.

The original 1.00 release is available and documented on the dPUC 1.0 website.

Installation, dependencies

dPUC 2 requires Perl 5 and the Inline::C Perl package, which can be installed with cpan (which is included in standard Perl installations) by running the following command as the root user,

cpan Inline::C
and proceeding with their instructions (you may be asked to install additional packages). Do not hesitate to email me if the installation fails and you need further assistance.

This code also requires the lp_solve 5.5 library (but not the executable). In Fedora/RedHat Linux, you can install it using the following command as root,

yum install lpsolve-devel

Other than that, all you have to do is unpack the code and run the scripts from the directory that contains them. In this mode, the scripts and all packages should stay together in the same directory. You can also run my scripts from arbitrary directories.

You will also need HMMER3 (only the hmmpress and hmmscan executables are needed). Our code also assumes gzip is available in your system.

Compiling C code

Once the two outside dependencies have been installed correctly, you will need to compile the C portion of dPUC. This compilation will happen automatically the first time 1dpuc2.pl is run, so you may simply run

perl -w 1dpuc2.pl
which will take a little while, then show the "usage" message. If you see an error about not finding the lp_solve library, you may have to change the hardcoded path to the library in my package DpucLpSolve.pm (please email me about this or any other errors, I'd like to know about them).

Running the dPUC scripts

HMM databases

The current version of dPUC only works with Pfam. You must download Pfam-A.hmm.gz and Pfam-A.hmm.dat.gz from the Pfam FTP site. It will be necessary to use HMMER3's hmmpress on this HMM database before hmmscan can use it. The following commands achieve that.

# uncompress HMM database first 
gunzip Pfam-A.hmm.gz 

# prepare HMM database for searching with HMMER3 (which provides hmmpress) 
# four files will be generated, with names Pfam-A.hmm.h3{f,i,m,p} 
hmmpress Pfam-A.hmm 

# (optional) recompress the original HMM file (HMMER3 doesn't use it) 
gzip Pfam-A.hmm

You may keep these file compressed from now on, dPUC will read them correctly!

dPUC context count network

For completeness, my distribution includes a script that will generate the dPUC directed context network of domain family pair counts from Pfam-A.full.gz. However, downloading Pfam-A.full.gz takes a long time (in Pfam 28, that file is 13 GB!). For most users it makes more sense to download my precomputed network files, which I provide for every Pfam release starting with version 23.

Note that the context network file format changed with dPUC version 2.01. All files below are in the new format. The old format files are no longer available.

/dl.png dpucNet.pfam28.txt.gz (1020 KB)

dpucNet.pfam27.txt.gz (696 KB)

dpucNet.pfam26.txt.gz (558 KB)

dpucNet.pfam25.txt.gz (357 KB)

dpucNet.pfam24.txt.gz (315 KB)

dpucNet.pfam23.txt.gz (223 KB)

Sample sequence file

To run through these examples, any FASTA format protein sequence file can be used. If you want, you can try this minimal sample file, sample.fa.gz, which consists of only 10 random proteins from the Plasmodium falciparum proteome. The two sample outputs are sample.28.txt.gz and sample.28.dpuc.txt.gz.

Sample outputs for these proteins were created using Pfam 28 and the 64-bit version of HMMER 3.1b1. Small differences in output arise even from using a 32-bit architecture, be aware of that. Additionally, HMMER 3.1b1 timestamps the output, so every run will appear to generate a different file if compared with hash functions such as SHA or MD5, so don't use them. It is better to compare each line with a tool such as zdiff, which is like diff but for compressed files.

Synopsis of scripts

The following commands can be run on the sample file provided above, plus two Pfam files, all placed in the same directory as the code and called from that location. All input files may be compressed with gzip and they may be specified with or without the GZ extension. Output files are compressed with gzip, whether the outputs indicate a GZ extension or not. So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.

# create the dPUC context count network for your Pfam version 
# (skip this step by downloading above the precomputed dPUC network you need) 
perl -w dpucNet.pl Pfam-A.full dpucNet.txt 

# produce domain predictions with HMMER3 (provides hmmscan) with weak filters. 
# this is the slowest step 
perl -w 0runHmmscan.pl hmmscan Pfam-A.hmm sample.fa sample.txt 
# (optional) compress output, our scripts will read it either way 
gzip sample.txt 

# produce dPUC predictions with default parameters 
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt 1e-4 

# produce dPUC predictions for more than one p-value threshold for candiate domains 
# (produces one output for each threshold) 
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt 1e-1 1e-4

dpucNet.pl 1.00 - Extract the context count network from Pfam predictions

perl -w dpucNet.pl <Pfam-A.full> <dpucNet output>
The required inputs are

This Pfam-specific script counts every ordered domain family pair observed in the full predictions of standard Pfam releases. The code is very optimized to parse such a large file quickly and to store the domain predictions with minimal memory usage (unfortunately, Pfam-A.full lists domains grouped first by family, whereas these family pair counts can only be calculated if predictions are grouped first by protein, so all the data must be in memory before it can be rearranged). The run for Pfam 28 used 10.6 GB of RAM and took 20 minutes.

You will only need to run this command once per Pfam release. Moreover, you can skip this step altogether by downloading the precomputed dPUC context count networks I provide above (which I recommend, since downloading Pfam-A.full takes a very long time). This script is mostly provided for completeness and so the community can inspect it, find bugs, and otherwise improve it.

0runHmmscan.pl 1.01 - Get domain predictions from your protein sequences

perl -w 0runHmmscan.pl <hmmscan> <Pfam-A.hmm> <FASTA input> <output table>
The required inputs are

This script doesn't simply run hmmscan with default parameters, in fact it changes a few options that are important for dPUC to work. In particular, it makes sure outputs report \(p\)-values in place of \(E\)-values, and it relaxes the heuristic \(p\)-value filters to allow more predictions through. This script sets the most stringent \(p\)-value threshold at 1e-4, which matches the domain candidate \(p\)-value threshold that dPUC uses, and which worked best in our benchmarks.

1dpuc2.pl 1.01 - Produce dPUC domain predictions from raw hmmscan data.

perl -w 1dpuc2.pl <Pfam-A.hmm.dat> <dpucNet> <input table> <output table> \
  [<p-value cuts...>]
The required inputs are

The following parameters are optional.

If more than one \(p\)-value threshold is provided on candidate domains: Note that dPUC predictions are not necessarily nested as candidate domain thresholds are varied, since dPUC solves a complicated pairwise optimization problem, so these files are not redundant and are not obtained by filtering the largest file. Each output is as if dPUC had been run anew on the given \(p\)-value threshold, although the code does this efficiently by factoring out shared steps between thresholds.

Note that dPUC 2 has many options that are hardcoded in this script, and for simplicity they are not documented. They are experimental features, but most of them haven't been useful in my benchmarks. If you know what you're doing, feel free to modify this script to change these parameters.

This script uses a permissive overlap definition, in which only overlaps that exceed 40 amino acids or 50% of the length of the smaller domain are removed. These parameters are hardcoded in the script, modify them if you know what you're doing!

Compatibility

Code was tested with HMMER versions 3.0 and 3.1b1, against the Pfam 25, 27, and 28 HMM databases. The code worked with Inline::C 0.53 and 0.62, and with lp_solve 5.5.2.0.

Notes shared with other projects

Please read some additional notes by following the next link.

My public code
My public code, description of my original Perl packages
The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

Citations

Alejandro Ochoa, Manuel Llinás, and Mona Singh. Using context to improve protein domain identification. BMC Bioinformatics 2011, 12:90. 2011-03-31. Pubmed Article

Alejandro Ochoa, John Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. Submitted. arXiv 2014-09-23.

Release notes

2015-07-27 - dPUC 2.02

The main script now allows multiple \(p\)-value thresholds for candidate domains passed as an option on the command line. The script used to have a single threshold hardcoded, and did not handle outputs for multiple thresholds.

2015-07-23 - dPUC 2.01

The most important change is the addition of domain family checks to ensure the correct version of Pfam is being used throughout.

The code was tested with Pfam 28, which was released on May 2015. All the samples provided now correspond to that version. In testing, I found a bug in the Pfam-A.hmm.dat file, whose "nesting" associations between domains families were blank, and this caused my old Pfam-A.hmm.dat parser to complain to STDERR a lot. I adjusted my code so this does not happen anymore. Unfortunately, for the current Pfam 28 Pfam-A.hmm.dat, there is no domain family nesting information available for use, thought I contacted the Pfam team and this bug may be fixed in the future (and my code will work correctly when that happens).

2014-09-18 - dPUC 2.00

This was the first public release of the second series of the code.

Compared to the 1.0 version, lots has changed. In summary, the 2.xx series works with HMMER3/Pfam 24 (and newer), and it is very highly optimized to reduce runtime. Importantly, the output format has changed (now it is simply like the HMMER3 domain tabular output files).

Compared to the beta version, many packages were reorganized, in general for the better. Also, I include code to generate the dPUC count network (the beta version simply provided the network for Pfam 24, and that network had strange filters that I have removed in this new public version).

2014-02-37 - dPUC 2.00 beta

I prepared this version for Juliana Bernardes, but until I posted it with the final version 2.00, I believe nobody else had seen this version, so I do not consider it a public version.

2011-02-06 - dPUC 1.0

Last change recorded on the public source code. See the dPUC (1.0) website and our paper for more information.

VIIIA

History