generate high-order Markov random protein sequences

by Alejandro Ochoa García, John D Storey, Manuel Llinás, Mona Singh

Fast implementation that captures subtleties of protein sequences


en-us es-mx - - Email Me

Here you can download our code, and learn how to use it. This page is the software's manual.

About RandProt

The goal of this software is to efficiently generate random protein sequences from a high-order Markov model. Roughtly-speaking, a \(k-1\)-order Markov model preserves the \(k\)-mer distribution of the original data, which is how the model is encoded internally. The code emphasizes simplicity and computational efficiency. Note, however, that most of the code is specific to protein sequences, it may not be reused for generating other sequences without significant changes.

The formal specification of the random sequences follows. The input is a collection of protein sequences. The output will have, for every input sequence, \(n\) random sequences with the same length (with the same IDs plus the suffix.RAND\(i\) where \(i\) goes from \(0\) to \(n-1\) with padded zeroes). The \(k\)-mer distribution is obtained from the input data excluding initial methionine (M) residues, when present. The random sequences always begin with M, followed by a random \(k\)-mer drawn from the global distribution, followed by individual amino acids drawn from the Markov model (that is, conditional on the previous \(k-1\) amino acids). The sequence is terminated when the desired length is attained.

There's more details of lesser importance. The input is always "normalized" so that only the 20 standard amino acids are considered. To achieve this, selenocysteines (U) are replaced by cysteines (C), and pyrrolysines (O) by lysines (K). Additionally, \(k\)-mers with ambiguous amino acid codes (B,J,X,Z) are always ignored.

Source code

/dl.png RandProt-1.00.tgz (23 KB)

Download our Perl source code, in a gzip-compressed tar archive file. It is very small, 4 packages and 3 scripts totalling 434 lines of code (according to cloc).

All my code is released under the GNU GPLv3 (GNU General Public License version 3).

This project is now on GitHub, where you can contribute and improve this software.


Our perl code is self-contained, so only Perl 5 is required (no external perl packages are needed). All you have to do is unpack the code and run the scripts from the directory that contains them. In this mode, the scripts and all packages should stay together in the same directory. You can also run my scripts from arbitrary directories.

Our code assumes gzip is available in your system.

Running the RandProt scripts


Besides the input sequence file, the main analysis requires positive integer parameters \(k\) and \(n\). The value of \(n\) provides the number of random samples per real sequence. In theory, the larger the better, but the complexity of downsteam analyses may set upper limits on \(n\). I've personally used \(n=100\) and \(n=20\) in my domain prediction FDR benchmarks, but for your needs \(n=1\) may be perfectly acceptable.

The value of \(k\) controls how much the random sequences will look like real sequences, with small values of \(k\) yielding less realistic random proteins. However, a value of \(k\) that is too large can be bad, as it may recapitulate properties of your real proteins that you do not wish to have in your random sequences (for example, in my unpublished research I found that some protein domains can be well predicted by the presence of \(5\)-mers that represent highly-conserved motifs, so even \(k=5\) preserves some biologically-relevant signal that perhaps should be excluded from random sequences). In my work I favored \(k=3\).

Sample sequence file

To run through these examples, any FASTA format protein sequence file can be used. If you want, you can try this file, PLAF7.fa.gz, which is the Plasmodium falciparum proteome from PlasmoDB version 11.0, with pseudogenes removed, it contains 5441 proteins and is 2.3 MB in size (compressed).

Synopsis of scripts

The following commands can be run on the sample file provided above placed in the same directory as the code and called from that location. The input file may be compressed with gzip and may be specified with or without the GZ extension. The output file is compressed with gzip, whether the outputs indicate a GZ extension or not. So the commands as they are below produce and will work entirely with compressed files, without a single GZ extension indicated.

# get a quick estimate of the maximum k to consider 
perl -w kMax.pl PLAF7.fa 

# try a few values of k until a decent coverage is achieved 
perl -w kCov.pl PLAF7.fa 5 
perl -w kCov.pl PLAF7.fa 4 
perl -w kCov.pl PLAF7.fa 3 

# generate your desired random proteome! 
perl -w randProt.pl PLAF7.fa PLAF7.k3.n100.fa 3 100

kMax.pl 1.00 - Compute a weak upper bound on \(k\) for \(k\)-mer analysis

perl -w kMax.pl <input FASTA>
The required inputs are

Input file may be compressed with gzip, and may be specified with or without the GZ extension. Analysis is sent to the standard output.

The reasoning behind this script follows. The maximum recommended value of \(k\) depends in part on the size of your input sequences. This is because very small sets of sequences may not contain enough samples of all \(k\)-mers if \(k\) is too large. A sequence of length \(l\) has at most \(l-k+1\) unique \(k\)-mers, while there are \(20^k\) unique \(k\)-mers overall. Therefore, in order to ensure that all \(k\)-mers may appear in the input, now for a proteome with \(l\) amino acids (assuming \(k\) is much smaller than most protein lengths) \(k\) should satisfy the following weak upper bound, $$ k \le k_{max} = \left \lfloor \frac{\log(l)}{\log(20)} \right \rfloor, $$ but usually \(k\) must be even smaller than this upper bound suggests, to ensure that the space is sampled adequately. For the input FASTA file, this script ouputs the values of \(l\) and \(k_{max}\), just as in the example below.

Input: PLAF7.fa.gz
Number of letters: 4139904
kMax: 5
This is intended to be a very quick rough analysis. For a more precise assessment on \(k\)-mer coverage for a given dataset, look at the next script.

kCov.pl 1.00 - Compute percentage of \(k\)-mers observed in a proteome

perl -w kCov.pl <input FASTA> <k>
The required inputs are

Input file may be compressed with gzip, and may be specified with or without the GZ extension. Analysis is sent to the standard output.

This script simply obtains the \(k\)-mer distribution from the input file, and compares the number of unique \(k\)-mers observed to the theoretical number of possible \(k\)-mers. For the sample file, the outputs for \(k=5,4,3\) (run separately) are shown below, which shows that \(k=3\) should be the maximum value of \(k\) to consider for this data.

k: 5
Number of theoretical amino acid k-mers: 3200000
Scanning proteome...
Number of observed amino acid k-mers: 1226974
Proportion of k-mers observed: 38.34 %

k: 4
Number of theoretical amino acid k-mers: 160000
Scanning proteome...
Number of observed amino acid k-mers: 148284
Proportion of k-mers observed: 92.68 %

k: 3
Number of theoretical amino acid k-mers: 8000
Scanning proteome...
Number of observed amino acid k-mers: 7999
Proportion of k-mers observed: 99.99 %
I highly recommend choosing a \(k\) for which most \(k\)-mers were observed, such as \(k=3\) in the case above, but that is merely an upper bound, and you may want to choose a smaller \(k\) using other criteria.

This is the more refined way of choosing an upper bound for \(k\), it is more accurate than the output of kMax.pl. However, kMax.pl will be much faster on very large inputs, so it is recommended to run it first to get a rough idea of the range of \(k\) to test.

randProt.pl 1.00 - Make random protein sequences from a high-order Markov model

perl -w randProt.pl <input FASTA> <output FASTA> <k> <n>
The required inputs are

Input file may be compressed with gzip, and may be specified with or without the GZ extension. Output file will be automatically compressed with gzip.

For the sample file, I provide here one random output using \(k=3\) and \(n=100\), PLAF7.k3.n100.fa.gz (227 MB), so you can see the format, but it won't be identical to any of your ouputs because they are always different. On my machine, it took 24 minutes to generate this random proteome (it is much faster than a naive method, but it's not that fast).

Notes shared with other projects

Please read some additional notes by following the next link.

My public code
My public code, description of my original Perl packages
The union of code across the latest DomStratStats, dPUC, and RandProt distributions.

UniRef50-based random sequences

In our paper cited below, we used the following random protein sequence file that was generated by our software. It is a large file!

/dl.png UNIREF50.k3.fa.gz (764 MB)


Alejandro Ochoa, John Storey, Manuel Llinás, and Mona Singh. Beyond the E-value: stratified statistics for protein domain prediction. Submitted. arXiv 2014-09-23.

Release notes

2014-09-12 - RandProt 1.00

This was the first public release of the code.