My public code

description of my original Perl packages

by Alejandro Ochoa GarcĂ­a

The union of code across the latest DomStratStats, dPUC, and RandProt distributions.


en-us es-mx - - Email Me


This document summarizes my latest public code that is part of three related projects, namely DomStratStats 1.04, dPUC 2.06, and RandProt 1.01. Many packages are shared, so here I describe them for all projects at once. See the page of each project for code downloads, release notes and additional information.

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

Running scripts from arbitrary directories

Look at the next example. This way of specifying the location of my packages has been verified to work for all my scripts.

# run script from the directory that contains it 
cd /myCode/ 
perl -w <ARGS>... 

# run script from other directories 
cd /otherDir/ 
perl -I/myCode/ -w /myCode/ <ARGS>...

Description of original Perl packages

Since my source code is not self-documented, here's a brief description of what each package does.

File Latest version Description Projects that share it 1.01 Handles normal and compressed files transparently. DomStratStats, dPUC, RandProt 1.00 Light-weight tools to handle FASTA files. DomStratStats, RandProt 1.01 Parses Pfam-A.hmm.dat and other Pfam-specific files. DomStratStats, dPUC 1.00 Processes domains, particularly overlaps. DomStratStats, dPUC 1.03 Runs and parses hmmscan outputs into my domain structure, one protein at the time to reduce the memory footprint, also produces outputs (intended for domain filtering only) and adds custom columns. DomStratStats, dPUC 1.01 General-purpose tools for computing and analyzing \(p\)-values to get \(q\)-values, particularly for censored \(p\)-values. DomStratStats 1.02 General-purpose tools for computing and analyzing \(p\)-values to get lFDRs (local False Discovery Rates), particularly for censored \(p\)-values. DomStratStats 1.03 Applies \(q\)-values and lFDRs to domain prediction from HMMER3. DomStratStats 1.00 Functions to normalize protein sequences, and to count \(k\)-mers efficiently. RandProt 1.01 Functions that generate random sequences from the \(k\)-mer data (and protein lengths). The most challenging part to develop was drawing the initial \(k\)-mer of a sequence, which entails drawing from an extremely large categorical distribution with non-uniform probabily parameters. I encoded a binary-search-based method that computes each draw in \(O(\log(m))\), where \(m \approx 20^k\) is the number of categories, but recently I've been made aware of even faster methods, which I may implement in the future. Regardless, this is much faster than a naive implementation. RandProt 2.06 Main dPUC package that connects the different strategies for predicting domains using dPUC context scores. dPUC 2.00 Portion of the code in C that solves the most numerically-intense part of the "positive elimination" of dPUC. dPUC 1.01 Tells Perl where to find the lpsolve55 C library, constructs an "lp" object from the Perl data to solve with lp_solve, and returns the result to Perl. dPUC 2.02 Extracts a directed network of domain family pair counts as observed in a Pfam-A.full file. dPUC 1.01 Turns the dPUC context count network into a bitscore network for domain prediction. dPUC 1.00 Compacts domain overlap definitions, by finding cliques, to make lp_solve more efficient. dPUC 1.00 Finds connected components in a network. dPUC 1.00 Maps non-negative integer pairs into single integers, for the ordered and unordered pair cases. dPUC

Description of original Perl scripts

Only one script is shared, but I list everything here for completeness and also to document versions.

File Latest version Description Projects that share it 1.01 Get domain predictions from your protein sequences. DomStratStats, dPUC 1.02 Removes overlapping domains ranking by \(p\)-value. DomStratStats 1.00 Computes and adds domain \(q\)-values, lFDRs, and FDR|lFDR. DomStratStats 1.00 Computes and adds \(q\)-values for sequences and domains. DomStratStats 1.00 Get final domain predictions from multiple sequence files. DomStratStats 1.01 Extract the context count network from Pfam predictions. dPUC 1.01 Produce dPUC domain predictions from raw hmmscan data. dPUC 1.01 Compute a weak upper bound on \(k\) for \(k\)-mer analysis. RandProt 1.01 Compute percentage of \(k\)-mers observed in a proteome. RandProt 1.01 Make random protein sequences from a high-order Markov model. RandProt


This code was tested with Perl 5.18, 5.20, and 5.22 (but should work for any version ≥5). The code is expected to work on any Linux and MacOS, but let me know otherwise.

Want a Windows version?

The code will not work on Windows machines because it uses the gzip executable and it also uses Unix pipes. However, if you install the PerlIO::gzip package, or forgo working with compressed files, the code could run (with some adjustments, contact me for more info).