# My public code

## description of my original Perl packages

### by Alejandro Ochoa García

The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

VIIIA

- -

## Overview

This document summarizes my latest public code that is part of three related projects, namely DomStratStats 1.04, dPUC 2.06, and RandProt 1.01. Many packages are shared, so here I describe them for all projects at once. See the page of each project for code downloads, release notes and additional information.

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

## Running scripts from arbitrary directories

Look at the next example. This way of specifying the location of my packages has been verified to work for all my scripts.

# run script from the directory that contains it cd /myCode/ perl -w myScript.pl <ARGS>... # run script from other directories cd /otherDir/ perl -I/myCode/ -w /myCode/myScript.pl <ARGS>...

## Description of original Perl packages

Since my source code is not self-documented, here's a brief description of what each package does.

 File Latest version Description Projects that share it FileGz.pm 1.01 Handles normal and compressed files transparently. DomStratStats, dPUC, RandProt ParseFasta.pm 1.00 Light-weight tools to handle FASTA files. DomStratStats, RandProt ParsePfam.pm 1.01 Parses Pfam-A.hmm.dat and other Pfam-specific files. DomStratStats, dPUC Domains.pm 1.00 Processes domains, particularly overlaps. DomStratStats, dPUC Hmmer3ScanTab.pm 1.03 Runs and parses hmmscan outputs into my domain structure, one protein at the time to reduce the memory footprint, also produces outputs (intended for domain filtering only) and adds custom columns. DomStratStats, dPUC Qvalue.pm 1.01 General-purpose tools for computing and analyzing $$p$$-values to get $$q$$-values, particularly for censored $$p$$-values. DomStratStats QvalueLocal.pm 1.02 General-purpose tools for computing and analyzing $$p$$-values to get lFDRs (local False Discovery Rates), particularly for censored $$p$$-values. DomStratStats DomStratStats.pm 1.03 Applies $$q$$-values and lFDRs to domain prediction from HMMER3. DomStratStats ProtKmer.pm 1.00 Functions to normalize protein sequences, and to count $$k$$-mers efficiently. RandProt ProtMarkov.pm 1.01 Functions that generate random sequences from the $$k$$-mer data (and protein lengths). The most challenging part to develop was drawing the initial $$k$$-mer of a sequence, which entails drawing from an extremely large categorical distribution with non-uniform probabily parameters. I encoded a binary-search-based method that computes each draw in $$O(\log(m))$$, where $$m \approx 20^k$$ is the number of categories, but recently I've been made aware of even faster methods, which I may implement in the future. Regardless, this is much faster than a naive implementation. RandProt Dpuc.pm 2.06 Main dPUC package that connects the different strategies for predicting domains using dPUC context scores. dPUC DpucPosElim.pm 2.00 Portion of the code in C that solves the most numerically-intense part of the "positive elimination" of dPUC. dPUC DpucLpSolve.pm 1.01 Tells Perl where to find the lpsolve55 C library, constructs an "lp" object from the Perl data to solve with lp_solve, and returns the result to Perl. dPUC DpucNet.pm 2.02 Extracts a directed network of domain family pair counts as observed in a Pfam-A.full file. dPUC DpucNetScores.pm 1.01 Turns the dPUC context count network into a bitscore network for domain prediction. dPUC DpucOvsCompact.pm 1.00 Compacts domain overlap definitions, by finding cliques, to make lp_solve more efficient. dPUC NetCC.pm 1.00 Finds connected components in a network. dPUC EncodeIntPair.pm 1.00 Maps non-negative integer pairs into single integers, for the ordered and unordered pair cases. dPUC

## Description of original Perl scripts

Only one script is shared, but I list everything here for completeness and also to document versions.

 File Latest version Description Projects that share it 0runHmmscan.pl 1.01 Get domain predictions from your protein sequences. DomStratStats, dPUC 1noOvs.pl 1.02 Removes overlapping domains ranking by $$p$$-value. DomStratStats 2domStratStats.pl 1.00 Computes and adds domain $$q$$-values, lFDRs, and FDR|lFDR. DomStratStats 3tieredStratQ.pl 1.00 Computes and adds $$q$$-values for sequences and domains. DomStratStats 4allManyOrgs.pl 1.00 Get final domain predictions from multiple sequence files. DomStratStats dpucNet.pl 1.01 Extract the context count network from Pfam predictions. dPUC 1dpuc2.pl 1.01 Produce dPUC domain predictions from raw hmmscan data. dPUC kMax.pl 1.01 Compute a weak upper bound on $$k$$ for $$k$$-mer analysis. RandProt kCov.pl 1.01 Compute percentage of $$k$$-mers observed in a proteome. RandProt randProt.pl 1.01 Make random protein sequences from a high-order Markov model. RandProt

## Compatibility

This code was tested with Perl 5.18, 5.20, and 5.22 (but should work for any version ≥5). The code is expected to work on any Linux and MacOS, but let me know otherwise.

### Want a Windows version?

The code will not work on Windows machines because it uses the gzip executable and it also uses Unix pipes. However, if you install the PerlIO::gzip package, or forgo working with compressed files, the code could run (with some adjustments, contact me for more info).

VIIIA

History