My public code

description of my original Perl packages

by Alejandro Ochoa García

The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

Thumbnail
VIIIA

en-us es-mx - - Email Me

Overview

This document summarizes my latest public code that is part of three related projects, namely DomStratStats 1.04, dPUC 2.06, and RandProt 1.01. Many packages are shared, so here I describe them for all projects at once. See the page of each project for code downloads, release notes and additional information.

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

Running scripts from arbitrary directories

Look at the next example. This way of specifying the location of my packages has been verified to work for all my scripts.

# run script from the directory that contains it 
cd /myCode/ 
perl -w myScript.pl <ARGS>... 

# run script from other directories 
cd /otherDir/ 
perl -I/myCode/ -w /myCode/myScript.pl <ARGS>...

Description of original Perl packages

Since my source code is not self-documented, here's a brief description of what each package does.

File Latest version Description Projects that share it
FileGz.pm 1.01 Handles normal and compressed files transparently. DomStratStats, dPUC, RandProt
ParseFasta.pm 1.00 Light-weight tools to handle FASTA files. DomStratStats, RandProt
ParsePfam.pm 1.01 Parses Pfam-A.hmm.dat and other Pfam-specific files. DomStratStats, dPUC
Domains.pm 1.00 Processes domains, particularly overlaps. DomStratStats, dPUC
Hmmer3ScanTab.pm 1.03 Runs and parses hmmscan outputs into my domain structure, one protein at the time to reduce the memory footprint, also produces outputs (intended for domain filtering only) and adds custom columns. DomStratStats, dPUC
Qvalue.pm 1.01 General-purpose tools for computing and analyzing \(p\)-values to get \(q\)-values, particularly for censored \(p\)-values. DomStratStats
QvalueLocal.pm 1.02 General-purpose tools for computing and analyzing \(p\)-values to get lFDRs (local False Discovery Rates), particularly for censored \(p\)-values. DomStratStats
DomStratStats.pm 1.03 Applies \(q\)-values and lFDRs to domain prediction from HMMER3. DomStratStats
ProtKmer.pm 1.00 Functions to normalize protein sequences, and to count \(k\)-mers efficiently. RandProt
ProtMarkov.pm 1.01 Functions that generate random sequences from the \(k\)-mer data (and protein lengths). The most challenging part to develop was drawing the initial \(k\)-mer of a sequence, which entails drawing from an extremely large categorical distribution with non-uniform probabily parameters. I encoded a binary-search-based method that computes each draw in \(O(\log(m))\), where \(m \approx 20^k\) is the number of categories, but recently I've been made aware of even faster methods, which I may implement in the future. Regardless, this is much faster than a naive implementation. RandProt
Dpuc.pm 2.06 Main dPUC package that connects the different strategies for predicting domains using dPUC context scores. dPUC
DpucPosElim.pm 2.00 Portion of the code in C that solves the most numerically-intense part of the "positive elimination" of dPUC. dPUC
DpucLpSolve.pm 1.01 Tells Perl where to find the lpsolve55 C library, constructs an "lp" object from the Perl data to solve with lp_solve, and returns the result to Perl. dPUC
DpucNet.pm 2.02 Extracts a directed network of domain family pair counts as observed in a Pfam-A.full file. dPUC
DpucNetScores.pm 1.01 Turns the dPUC context count network into a bitscore network for domain prediction. dPUC
DpucOvsCompact.pm 1.00 Compacts domain overlap definitions, by finding cliques, to make lp_solve more efficient. dPUC
NetCC.pm 1.00 Finds connected components in a network. dPUC
EncodeIntPair.pm 1.00 Maps non-negative integer pairs into single integers, for the ordered and unordered pair cases. dPUC

Description of original Perl scripts

Only one script is shared, but I list everything here for completeness and also to document versions.

File Latest version Description Projects that share it
0runHmmscan.pl 1.01 Get domain predictions from your protein sequences. DomStratStats, dPUC
1noOvs.pl 1.02 Removes overlapping domains ranking by \(p\)-value. DomStratStats
2domStratStats.pl 1.00 Computes and adds domain \(q\)-values, lFDRs, and FDR|lFDR. DomStratStats
3tieredStratQ.pl 1.00 Computes and adds \(q\)-values for sequences and domains. DomStratStats
4allManyOrgs.pl 1.00 Get final domain predictions from multiple sequence files. DomStratStats
dpucNet.pl 1.01 Extract the context count network from Pfam predictions. dPUC
1dpuc2.pl 1.01 Produce dPUC domain predictions from raw hmmscan data. dPUC
kMax.pl 1.01 Compute a weak upper bound on \(k\) for \(k\)-mer analysis. RandProt
kCov.pl 1.01 Compute percentage of \(k\)-mers observed in a proteome. RandProt
randProt.pl 1.01 Make random protein sequences from a high-order Markov model. RandProt

Compatibility

This code was tested with Perl 5.18, 5.20, and 5.22 (but should work for any version ≥5). The code is expected to work on any Linux and MacOS, but let me know otherwise.

Want a Windows version?

The code will not work on Windows machines because it uses the gzip executable and it also uses Unix pipes. However, if you install the PerlIO::gzip package, or forgo working with compressed files, the code could run (with some adjustments, contact me for more info).

VIIIA

History