dPUC 2

Domain Prediction Using Context

by Alejandro Ochoa García, Manuel Llinás, Mona Singh

Extend your Pfam predictions without loss of precision using domain context!

Thumbnail
VIIIA

en-us es-mx - - Email Me

Here you can download our code, and learn how to use it. This page is the software's manual.

About dPUC 2

The dPUC (Domain Prediction Using Context) software improves domain prediction by considering every domain in the context of other domains, rather than independently as standard approaches do. Our framework maximized the probability of the data under an approximation, which reduces to a pairwise context problem, and we have shown that our probabilistic method is indeed more powerful than competing methods.

The dPUC 2.xx series is a major update from dPUC 1.0. For the average user, what matters is that now dPUC works with HMMER 3 and Pfam 24 onward, and dPUC is much faster than before. If you were a dPUC 1.0 user, you should know that inputs and outputs are completely different; the new setup is simpler due to the changes that HMMER3 and the new Pfams have brought. Lastly, there are improvements in the context network parametrization, most notably the addition of directed context. See the release notes for more information.

Source code

/dl.png Dpuc-2.06.tgz (75 KB)

Download our Perl source code, in a gzip-compressed tar archive file. It is a large project, with 12 packages and 3 scripts totalling 1753 lines of code (according to cloc).

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

This project is now on GitHub, where you can contribute and improve this software.

Previous versions

Code Dpuc-2.05.tgz (74 KB) and manual 2.05.

Code Dpuc-2.04.tgz (73 KB) and manual 2.04.

Code Dpuc-2.03.tgz (73 KB) and manual 2.03.

Code Dpuc-2.02.tgz (79 KB) and manual 2.02.

Code Dpuc-2.01.tgz (78 KB) and manual 2.01.

Code Dpuc-2.00.tgz (73 KB) and manual 2.00.

Beta code Dpuc-2.00b.tgz (192 KB), which was larger than subsequent releases because it contained sample files that are no longer included.

The original 1.00 release is available and documented on the dPUC 1.0 website.

Installation, dependencies

dPUC 2 requires Perl 5 and the Inline::C Perl package, which can be installed with cpan (which is included in standard Perl installations) by running the following command as the root user,

cpan Inline::C
and proceeding with their instructions (you may be asked to install additional packages). Do not hesitate to email me if the installation fails and you need further assistance.

This code also requires the lp_solve 5.5 library (but not the executable). In Fedora/RedHat Linux, you can install it using the following command as root,

yum install lpsolve-devel

Other than that, all you have to do is unpack the code and run the scripts from the directory that contains them. In this mode, the scripts and all packages should stay together in the same directory. You can also run my scripts from arbitrary directories.

You will also need HMMER3 (only the hmmpress and hmmscan executables are needed). Our code also assumes gzip is available in your system.

Compiling C code

Once the two outside dependencies have been installed correctly, you will need to compile the C portion of dPUC. This compilation will happen automatically the first time 1dpuc2.pl is run, so you may simply run

perl -w 1dpuc2.pl
which will take a little while, then show the "usage" message. If you see an error about not finding the lp_solve library, you may have to change the hardcoded path to the library in my package DpucLpSolve.pm (please email me about this or any other errors, I'd like to know about them).

Running the dPUC scripts

HMM databases

The current version of dPUC only works with Pfam. You must download Pfam-A.hmm.gz and Pfam-A.hmm.dat.gz from the Pfam FTP site. It will be necessary to use HMMER3's hmmpress on this HMM database before hmmscan can use it. The following commands achieve that.

# uncompress HMM database first 
gunzip Pfam-A.hmm.gz 

# prepare HMM database for searching with HMMER3 (which provides hmmpress) 
# four files will be generated, with names Pfam-A.hmm.h3{f,i,m,p} 
hmmpress Pfam-A.hmm 

# (optional) recompress the original HMM file (HMMER3 doesn't use it) 
gzip Pfam-A.hmm

You may keep these file compressed from now on, dPUC will read them correctly!

dPUC context count network

For completeness, my distribution includes a script that will generate the dPUC directed context network of domain family pair counts from Pfam-A.full.uniprot.gz for Pfam 29-31 (Pfam-A.full.gz for Pfam 23-28). However, downloading Pfam-A.full.uniprot.gz takes a long time (in Pfam 31, that file is 9.1 GB!). For most users it makes more sense to download my precomputed network files, which I provide for every Pfam release starting with version 23.

Note that the context network file format changed with dPUC version 2.01. All files below are in the new format. The old format files are no longer available.

/dl.png dpucNet.pfam31.txt.gz (1.2 MB)

dpucNet.pfam30.txt.gz (925K KB)

dpucNet.pfam29.txt.gz (842 KB)

dpucNet.pfam28.txt.gz (718 KB)

dpucNet.pfam27.txt.gz (503 KB)

dpucNet.pfam26.txt.gz (409 KB)

dpucNet.pfam25.txt.gz (271 KB)

dpucNet.pfam24.txt.gz (242 KB)

dpucNet.pfam23.txt.gz (175 KB)

These files are also available on a GitHub repository, dpuc2-data.

Sample sequence file

To run through these examples, any FASTA format protein sequence file can be used. If you want, you can try this minimal sample file, sample.fa.gz, which consists of only 10 random proteins from the Plasmodium falciparum proteome. The two sample outputs are sample.31.txt.gz and sample.31.dpuc.txt.gz.

Sample outputs for these proteins were created using Pfam 31 and the 64-bit version of HMMER 3.1b2. Small differences in output arise even from using a 32-bit architecture, be aware of that. Additionally, HMMER 3.1b2 timestamps the output, so every run will appear to generate a different file if compared with hash functions such as SHA or MD5, so don't use them. It is better to compare each line with a tool such as zdiff, which is like diff but for compressed files.

Synopsis of scripts

The following commands can be run on the sample file provided above, plus two Pfam files, all placed in the same directory as the code and called from that location. All input files may be compressed with gzip and they may be specified with or without the GZ extension. Output files are compressed with gzip, whether the outputs indicate a GZ extension or not. So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.

# create the dPUC context count network for your Pfam version 
# (skip this step by downloading above the precomputed dPUC network you need) 
perl -w dpucNet.pl Pfam-A.full.uniprot dpucNet.txt 

# produce domain predictions with HMMER3 (provides hmmscan) with weak filters. 
# this is the slowest step 
perl -w 0runHmmscan.pl hmmscan Pfam-A.hmm sample.fa sample.txt 
# (optional) compress output, our scripts will read it either way 
gzip sample.txt 

# produce dPUC predictions with default parameters 
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt 

# produce dPUC predictions for more than one p-value threshold for candiate domains 
# (produces one output for each threshold) 
perl -w 1dpuc2.pl Pfam-A.hmm.dat dpucNet.txt sample.txt sample.dpuc.txt --pvalues 1e-1 1e-4

dpucNet.pl 1.01 - Extract the context count network from Pfam predictions

perl -w dpucNet.pl <Pfam-A.full.uniprot> <dpucNet output>
The required inputs are

This Pfam-specific script counts every ordered domain family pair observed in the full predictions of standard Pfam releases. The code is very optimized to parse such a large file quickly and to store the domain predictions with minimal memory usage (unfortunately, Pfam-A.full.uniprot lists domains grouped first by family, whereas these family pair counts can only be calculated if predictions are grouped first by protein, so all the data must be in memory before it can be rearranged). The run for Pfam 28 used 10.6 GB of RAM and took 20 minutes.

You will only need to run this command once per Pfam release. Moreover, you can skip this step altogether by downloading the precomputed dPUC context count networks I provide above (which I recommend, since downloading Pfam-A.full.uniprot takes a very long time). This script is mostly provided for completeness and so the community can inspect it, find bugs, and otherwise improve it.

0runHmmscan.pl 1.01 - Get domain predictions from your protein sequences

perl -w 0runHmmscan.pl <hmmscan> <Pfam-A.hmm> <FASTA input> <output table>
The required inputs are

This script doesn't simply run hmmscan with default parameters, in fact it changes a few options that are important for dPUC to work. In particular, it makes sure outputs report \(p\)-values in place of \(E\)-values, and it relaxes the heuristic \(p\)-value filters to allow more predictions through. This script sets the most stringent \(p\)-value threshold at 1e-4, which matches the domain candidate \(p\)-value threshold that dPUC uses, and which worked best in our benchmarks.

1dpuc2.pl 1.01 - Produce dPUC domain predictions from raw hmmscan data.

perl -w 1dpuc2.pl [-options] <Pfam-A.hmm.dat> <dpucNet> <input table> <output table>
The required inputs are

Options:

If more than one \(p\)-value threshold is provided on candidate domains: Note that dPUC predictions are not necessarily nested as candidate domain thresholds are varied, since dPUC solves a complicated pairwise optimization problem, so these files are not redundant and are not obtained by filtering the largest file. Each output is as if dPUC had been run anew on the given \(p\)-value threshold, although the code does this efficiently by factoring out shared steps between thresholds.

Compatibility

Code was tested with HMMER versions 3.0, 3.1b1, 3.1b2, against the Pfam 25, 27-31 HMM databases. The code worked with Inline::C 0.53 and 0.62, and with lp_solve 5.5.2.0.

Notes shared with other projects

Please read some additional notes by following the next link.

My public code
My public code, description of my original Perl packages
The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

Citations

Alejandro Ochoa, Mona Singh. Domain prediction with probabilistic directional context. Submitted. bioRxiv 2016-12-14.

2015-11-17. Alejandro Ochoa, John D Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. PLoS Comput Biol. 11 e1004509. Pubmed Article arXiv 2014-09-23.

2011-03-31. Alejandro Ochoa, Manuel Llinás, and Mona Singh. Using context to improve protein domain identification. BMC Bioinformatics, 12:90. Pubmed Article

Release notes

2017-03-10 - Retested for Pfam 31 (code unchanged)

I reran the examples attached to this webpage using this newest Pfam, confirming that my code works.

2016-12-14 - Preprint released! (code unchanged)

Note that dPUC 2.06 was the version used in this preprint.

2016-08-10 - Retested for Pfam 30 (code unchanged)

I reran the examples attached to this webpage using this newest Pfam, confirming that my code works.

2015-12-23 - dPUC 2.06

Pfam 29 came out yesterday. I reran the examples attached to this webpage using this newest Pfam as well as the newest HMMER (3.1b2). Unfortunately, changes in Pfam broke my code that extracts the "context network" from Pfam-A.full. For Pfam 29, the correct file to use as input is now Pfam-A.full.uniprot, and the parser had to change since some comments it used to rely on are not present in this input file. Lastly, the output is now sorted internally (the first line, the node list, has sorted accessions; all subsequent lines are sorted relative to each other), so that when I change the code I may now verify more easily whether outputs changed or not.

I verified that my new parser in DpucNet.pm generates the same output as the previous parser for Pfam 23-28 (up to sorting). I replaced the old context networks I make available through this site with the sorted versions.

There's one more problem with Pfam 29. In previous Pfam versions, all accessions are present in Pfam-A.full. However, in Pfam 29, 260 families present in Pfam-A.hmm and Pfam-A.hmm.dat are missing from Pfam-A.full and Pfam-A.full.uniprot! My code relies on these two lists agreeing to ensure the same Pfam version was being used. To get around it, I hacked the Pfam 29 context network file I've made available online; it has the accession list from Pfam-A.hmm.dat rather than the list observed in Pfam-A.full.uniprot. Since the hack is simple but evil (because my sanity check shouldn't be avoided so easily on purpose), I have not made code publicly available that implements it. I expect the following Pfam version to not require such a dirty hack.

2015-11-26 - dPUC 2.05

I made some changes to the objective function and constraints, which bring the dPUC model closer to the first-order Markov model of Coin, et al (2003). In the objective, context scores are now half what they used to be (each edge is counted once rather than twice). The objective now has new terms that approximate the sequence threshold. Lastly, explicit domain and sequence score thresholds were removed (although there are equivalent implicit thresholds set by the objective function itself; this will be described in a forthcoming paper). The resulting ILP is simpler and dPUC now runs a bit faster because of these changes, and the quality of predictions was practically unchanged. The default pseudocount is now 1e-3 (had to get smaller because of the context score halving).

2015-10-31 - dPUC 2.04

The pseudocount, or "alpha" of the Dirichlet prior distribution for context scores is now parametrized directly, rather than through the log of a "scale" value (of total pseudocounts to total observed counts). The default value for this pseudocount is 0.01, which is very close to the previous default for Pfam 25 (of 0.008, but the exact value varied depending on the Pfam version). I recomputed all the samples using the new version, and found that they were all the same (for Pfam 28). So nothing really changed unless you were messing around with the '--scale' option, now you have to use '--alpha' instead and the numbers you put in are different.

2015-09-04 - dPUC 2.03

The main script now allows changing a multitude of options (most of which used to be hidden and hardcoded, only the \(p\)-value thresholds were exposed before).

The default scale has been changed from 23 to 3. Since the negative context score is approximately -scale, this means that now dPUC penalizes unobserved domain family pairs much less than it used to, allowing negative context to be overcome if there is sufficient positive context. Our benchmarks found scale=3 to perform best for Pfam 25.

Internally, the source code underwent a large cleaning, removing previously undocumented "experimental" features that didn't improve performance in my latest benchmarks. The code that generates context scores from counts is now particularly transparent.

2015-07-27 - dPUC 2.02

The main script now allows multiple \(p\)-value thresholds for candidate domains passed as an option on the command line. The script used to have a single threshold hardcoded, and did not handle outputs for multiple thresholds.

2015-07-23 - dPUC 2.01

The most important change is the addition of domain family checks to ensure the correct version of Pfam is being used throughout.

The code was tested with Pfam 28, which was released on May 2015. All the samples provided now correspond to that version. In testing, I found a bug in the Pfam-A.hmm.dat file, whose "nesting" associations between domains families were blank, and this caused my old Pfam-A.hmm.dat parser to complain to STDERR a lot. I adjusted my code so this does not happen anymore. Unfortunately, for the current Pfam 28 Pfam-A.hmm.dat, there is no domain family nesting information available for use, thought I contacted the Pfam team and this bug may be fixed in the future (and my code will work correctly when that happens).

2014-09-18 - dPUC 2.00

This was the first public release of the second series of the code.

Compared to the 1.0 version, lots has changed. In summary, the 2.xx series works with HMMER3/Pfam 24 (and newer), and it is very highly optimized to reduce runtime. Importantly, the output format has changed (now it is simply like the HMMER3 domain tabular output files).

Compared to the beta version, many packages were reorganized, in general for the better. Also, I include code to generate the dPUC count network (the beta version simply provided the network for Pfam 24, and that network had strange filters that I have removed in this new public version).

2014-02-37 - dPUC 2.00 beta

I prepared this version for Juliana Bernardes, but until I posted it with the final version 2.00, I believe nobody else had seen this version, so I do not consider it a public version.

2011-02-06 - dPUC 1.0

Last change recorded on the public source code. See the dPUC (1.0) website and our paper for more information.

VIIIA

History