Running fivepseq

Fivepseq requires the following files to run:

Aligned reads (.bam)
Alignment index (.bai)
Genomic sequence file (.fasta / .fa)
Genomic annotation file (.gff/ .gtf)

This section assumes that you already have these files. If not, please, refer to the section: Preparing the data.

Fivepseq usage

The fivepseq --help command will show fivepseq usage and will list all the arguments.

usage: fivepseq -b B -g G -a A [optional arguments]

Required arguments

-b B   full path to one or more bam/sam files (many files should be provided with a pattern, **within double quotes**: e.g. ["your_bam_folder/*.bam"])
-g G   full path to the fa/fasta file
-a A   full path to the gtf/gff/gff3 file

Note:

  • The indexed alignment files (.bam.bai) should be in the same directory as the bam files, with the same name and .bai extension.

  • Multiple bam files should be indicated with a pattern placed within double quotes: e.g. [“your_bam_folder/*.bam”]

Commonly, you will run fivepseq by also providing the name of the output folder (‘fivepseq’ by default) and the title of your run (if not provided, it will be determined from bam path):

fivepseq \
   -g <path_to_genome_fasta> \
   -a <path_to_annotation> \
   -b <path_to_bam_file(s) \
   -o <output_directory> \
   -t <title_of_the_run>

Note: this is a single commandline, the backslashes are used to move to a new line for cozy representation: either copy-paste like this or use a single line without the backslashes.

Optional arguments

--span SPAN

This specifies the number of bases to span around translation START and STOP positions. The default value is 100 bases.

--conflicts {add,overwrite,alt_dir}

The conflicts mode specifies how to deal with files/folders that already exist. You may choose either of the following options:

add:

(default). Only adds missing files to an existing directory and skips generating the ones that already exist.

overwrite:

Overwrites all the files in an existing output directory

alt_dir:

Creates an alternative directory by appending ‘+’ suffix to an existing output directory

--ignore-cache

When the annoatation gff/gtf file is read by fivepseq for the first time, it stores the transcript assembly object it generates in a pickle path, which is located in the same parent directory where your fivepseq output folder is, in the folder fivepseq_cache. If the annotation name stays the same, fivepseq will directly load this object in all further runs, instead of processing the annotation file. However, if the content of your annotation file has changed or you suspect that the previous pickle object might be truncated, you can use this option to ignore the cache and process the annotation file from scratch.

Additional arguments

-gf GF, -genefilter GF

If you are interested only in a specific set of transcripts, you may specify them with a text file containing newline-separated list of names you’d like to use. Note: only these genes will be used in all the calculations.

In a properly formatted gene filter file, the first line should specify the gene attribute in the gff file, and the rest of the lines should correspond to the actual values of that attribute in the gff file. For example, for the following two entries in the gff file:

Isgdgene335649.+.ID=gene:YAL069W;biotype=protein_coding;gene_id=YAL069W;logic_name=sgd
IsgdmRNA335649.+.ID=transcript:YAL069W_mRNA;Parent=gene:YAL069W;biotype=protein_coding;transcript_id=YAL069W_mRNA
Isgdexon335649.+.Parent=transcript:YAL069W_mRNA;Name=YAL069W_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL069W_mRNA-E1;rank=1
IsgdCDS335649.+0ID=CDS:YAL069W_mRNA;Parent=transcript:YAL069W_mRNA;protein_id=YAL069W_mRNA

Isgdgene538792.+.ID=gene:YAL068W-A;biotype=protein_coding;gene_id=YAL068W-A;logic_name=sgd
IsgdmRNA538792.+.ID=transcript:YAL068W-A_mRNA;Parent=gene:YAL068W-A;biotype=protein_coding;transcript_id=YAL068W-A_mRNA
Isgdexon538792.+.Parent=transcript:YAL068W-A_mRNA;Name=YAL068W-A_mRNA-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=YAL068W-A_mRNA-E1;rank=1
IsgdCDS538792.+0ID=CDS:YAL068W-A_mRNA;Parent=transcript:YAL068W-A_mRNA;protein_id=YAL068W-A_mRNA

The following gene filter file may be used, where the attribute is gene_id.

gene_id
YAL069W
YAL068W-A

In fact, the gene IDs you see in the file have the form gene:YAL069W, however, fivepseq tolerates if you just give the identifier following the colon.

You may also specify the CDS ID, again omitting (or not) the text before colon:

ID
YAL069W_mRNA
YAL068W-A_mRNA

You may also use other attributes in the CDS entry if you have alternative gff/gtf files. In case you specify the wrong attribute, the error message will tell you what attributes you may use. In case you use the wrong attribute values, the error message will list a few values of that attribute in the correct format.

-gs GS, -genesets GS

If you’d like to compare fivepseq plots between different gene sets, you may provide this as an additional argument. In comparison to the gene filter option above, the gene set option works as follows. Fivepseq first performs regular calculations, including all the genes in the annotation file, and later generates additional reports to compare profiles based on the gene sets provided.

The file should be tab-delimited. With the first column indicating the genes, while the second one - the gene set names. The first line again indicates the attribute name and the geneset heading, while the rest of the lines contain gene - gene set mappings. The following example demonstrates one such file (the attribute choice is explained above):

gene_id   GO:BP
YLR116W   mRNA splicing, via spliceosome
YDL070W   DNA repair
YPR030W   fungal-type cell wall organization
YJL158C   fungal-type cell wall organization
YFL039C   DNA repair

Note: the first line is required.

Advanced arguments

Options for codon-specific counts

The gene boundaries are usually masked by 20 nt (default) when counting the distribution of 5’P counts relative to codons and codon-motifs. You may change the mask size by setting it in the [3:50) range.

-ms [mask size],  -codon-mask-size [mask size]

Or you may turn masking off altogether, by passing in the following option:

--no-mask

After analyzing di/tri-codon/peptide specific counts fivepseq chooses top 50 motifs that have highest pausing at the position which corresponds to ribosome protected fragments when the last codon of the motif is in the A site, which is -14nt for di-peptides(codons) and -11nt for tri-peptides(codons). You may change these positions with the following options:

-dipeptide-pos

Pass a number in the range [-27:6) to this option to tell fivepseq counts in which position {-27:6} from A site should be ordered to output top stalled dipeptides.

-tripeptide-pos

Pass a number in the range [-24:9) to this option to tell fivepseq counts in which position {-24:9} from A site should be ordered to output top stalled tripeptides.

Noise removal

Fivepseq reduces noise by detecting 5’ counts that our outliers in the background count distribution. The latter is well approximated with Poisson distribution, computed based on the count distribution mean. Counts for which the probability of falling into this distribution is less than a certain threshold (0 by default) are considered as outliers. These outliers are usually down-scaled or down-sampled to the most extreme distribution count possible. However, you may modify this by either of the two options below:

--ds DS, --downsample DS

With this option you can omit the distribution-modeled down-sampling described above and specify a constant value instead. Counts exceeding this threshold will be down-scaled to it.

--op OP, --outlier_probability OP

With this option you may change the default probability threshold of Poisson distribution. You may increase it from 0, to be more harsh in allowing high count values. You can also use this option to turn off down-sampling altogether, by setting the probability threshold to -1.

Transcript features

By default fivepseq filters genes from the annotation file that have mRNA or protein_coding tags. You may change it to your desired type available under ‘type’ or ‘biotype’ attributes in your annotation.

-transcript-type

Subset

Fivepseq will only perform analysis for the first genes, the number being equal to the argument passed to this option.

-subset

Plotting options

-triangle_threshold

Count threshold (0:10000) for points in the triangle plot (points with lower counts will not be plotted)

Running fivepseq on test data

To run fivepseq on test alignment files, download the dataset from here.

The test data folder also contains source files for running fivepseq. There is a script for a quick test (src/fivepseq_quick.sh in the test_data folder), where only a small subset of transcripts is chosen. This test does not produce reasonable plots, but if it runs successfully you will know that everything is set up properly. Navigate to the test dataset directory (yeast_eIF5A) and run the ./src/fivepseq_quick.sh script, or copy paste this code to the command line:

fivepseq \
   -g genome/* \
   -a gff/* \
   -b "preprocess/align_dedup/*.bam" \
   -o fivepseq_quick \
   -t yeast_quick_test \
   -gf genesets/test_set.txt

This should produce a folder named fivepseq_quick, with the following file structure:

fivepseq_quick
├── fivepseq_counts
│   ├── 1_eIF5A_mut_R1
│       ├── protein_coding
│           ├── text files explained in the next chapters
│   ├── 2_WT_R3
│       ├── protein_coding
│           ├── text files explained in the next chapters
│   ├── 3_eIF5A_mut_R1
│       ├── protein_coding
│           ├── text files explained in the next chapters
├── fivepseq_plots
│   ├── main
│       ├── png
│       ├── svg
│       ├── yeast_quick_test_combined.html
│       ├── yeast_quick_test_main.html
│   ├── supplement
│       ├── png
│       ├── svg
│       ├── yeast_quick_test_amino_acid_linecharts.html
│       ├── yeast_quick_test_codon_heatmaps.html
│       ├── yeast_quick_test_codon_linecharts.html
│       ├── yeast_quick_test_dicodon_linecharts.html
│       ├── yeast_quick_test_dipeptide_linecharts.html
│       ├── yeast_quick_test_tricodon_linecharts.html
│       ├── yeast_quick_test_tripeptide_linecharts.html
│   ├── comparison
│       ├── yeast_quick_test_differential_heatmaps.html
│   ├── yeast_quick_test.html   # <<-- this is the main entry point for all the reports
├── log
│   ├── fivepseq.log
├── count_summary.txt

The main result of the run is the file fivepseq_quick/fivepseq_plots/yeast_quick_test.html. This file has hyperlinks to all the other report files. The files generated here will be explained in detail in the Interpreting the output section.