Count files

fivepseq_count directory listing

Sometimes you might want to use the files that contain raw counts or stats in text formats for alternative downstream applications. Fivepseq stores the count files in the fivepseq_counts directory in the output folder. Each sample has a default folder named protein_coding, where the count files and stats for all protein coding genes are stored. If fivepseq is run with the --geneset option enabled, count folders corresponding to each gene set will also appear along with the protein_coding folder. The files you should find inside each of these folders are listed below.

fivepseq
├── fivepseq_counts
│   ├── 1_eIF5A_mut_R1
│       ├── protein_coding
│           ├── amino_acid_pauses.txt
│           ├── canonical_transcript_index.txt
│           ├── codon_pauses.txt
│           ├── count_distribution.txt
│           ├── counts_FULL_LENGTH.txt
│           ├── counts_START.txt
│           ├── counts_TERM.txt
│           ├── data_summary.txt
│           ├── dicodon_pauses.txt
│           ├── dipeptide_pauses.txt
│           ├── fft_signals_start.txt
│           ├── fft_signals_term.txt
│           ├── fft_stats.txt
│           ├── frame_counts_START.txt
│           ├── frame_counts_TERM.txt
│           ├── frame_stats.txt
│           ├── meta_count_peaks_START.txt
│           ├── meta_count_peaks_TERM.txt
│           ├── meta_counts_TERM.txt
│           ├── meta_counts_TERM.txt
│           ├── meta_counts_START.txt
│           ├── meta_counts_TERM.txt
│           ├── outlier_lower.txt
│           ├── outliers_df.txt
│           ├── start_codon_dict.txt
│           ├── term_codon_dict.txt
│           ├── transcript_assembly.txt
│           ├── transcript_descriptors.txt
│           ├── transcript_fft.txt
│           ├── transcript_frame_prefs.txt
│           ├── tricodon_pauses.txt
│           ├── tripeptide_pauses.txt
│   ├── 2_WT_R3
│   ├── 3_eIF5A_mut_R1
├── fivepseq_plots
├── log
├── count_summary.txt

data_summary

This file contains general statistics on the number of reads in the protein coding regions (or the geneset) used by fivepseq.

NumOfReads:                     Number of reads in the coding regions, accounting for transcript filters
NumOfMapPositions:              Number of genome positions with at leads one mapping reads (more than one reads in the same position are counted as one)
NumOfReadsDownsampled:          Number of reads in the coding regions after downsampling (noise removal)
NumOfTranscripts:               Number of transcripts in the given annotation for the given filter
NumOfMapTranscripts:            Number of transcripts with at least one mapping read
MinNumOfReadsPerTranscript:     Minumum number of reads per transcript
MaxNumOfReadsPerTranscript:     Maximum number of reads per transcript
MedianNumOfReadsPerTranscript:  Median  number of reads per transcript

transcript_assembly

This file contains transcript descriptors: a row per transcript and six fields (columns). Note that transcript assemblies for each gene set filter will be different and will only include transcripts in the filter. Also, if --genefilter option is given, only the filtered transcripts will be included in the assembly.

ID:         transcript ID
gene:       gene ID (if found in the annotation)
chr:        chromosome name
cds_start:  start coordinate of the CDS
cds_end:    end coordinate of the CDS
type:       the transcript type (e.g. mRNA, tRNA, etc.)

transcript_descriptors

This file contains basic stats for each transcript. The order of transcripts corresponds to that of the transcript_assembly.txt. Each transcript is described by six fields:

start:                      the codon at transcript CDS start
stop:                       the codon at transcript CDS stop
len:                        transcript length
3nt:                        presence of 3-nucleotide periodicity in the transcript (1 for present, 0 for absent)
NumOfReads:                 number of reads mapping to that transcript
NumOfReadsDownsampled:      number of reads mapping to that transcript after noise removal
NumOfMapPositions:          number of positions in the transcript with at least one mapped read (multiple reads per position are counted as one)

start_codon_dict.txt

This file lists all the codons present at CDS start of the transcripts. For each codon, the number of transcripts where it is present is indicated. The codons are sorted in decreasing order of occurrence.

ATG     6590
TTG     1
GCT     2
AAC     1
ATA     6

term_codon_dict.txt

This file lists all the codons present at CDS stop of the transcripts. For each codon, the number of transcripts where it is present is indicated. The codons are sorted in decreasing order of occurrence.

TAG     1511
TGA     1951
TAA     3138

canonical_transcript_index

The indices (as they appear in the transcript_descriptors.txt) of transcripts containing canonical start (AUG) and termination ({TAG, TGA, TAA}) codons are listed in this file.

count_distribution

This file describes the distribution of read 5’ endpoint counts (those greater than 0).

outlier_lower

This file contains the maximum count value that falls into Poisson distribution according to the count distribution in the file above. All the counts greater than this value are down-scaled to it.

outliers_df

This file contains genomic positions with counts greater than the maximum allowed count (indicated in outlier_lower.txt) and shows the down-scaled (downsampled) count for it. Each such genomic position is described by the following fields:

ID:                     transcript ID
Name:                   gene Name (if such attribute exists for the annotation)
position_from_start:    the relative position of the location from the transcript CDS start (first nucleotide = 0)
position_from_term:     the relative position of the location from the transcript CDS stop (last nucleotide = 0)
actual_count:           the actual count of read 5' endpoint at the position
downasampled_count:     the count set at the location after noise removal

counts_FULL_LENGTH

Each row is a transcript in the order of transcript_assembly.txt. The counts correspond to the number of read 5’ endpoints mapping to each position in the transcript. The coordinates span the CDS and -100/+100 nucleotides around it.

counts_START

Each row is a transcript in the order of transcript_assembly.txt. The counts correspond to the number of read 5’ endpoints mapping to each position within the initial part of the transcript. The coordinates start from -100 to +99 from start (first nucleotide = 0).

counts_TERM

Each row is a transcript in the order of transcript_assembly.txt. The counts correspond to the number of read 5’ endpoints mapping to each position within the terminal part of the transcript. The coordinates start from -97 to +102 from stop (first nucleotide of the STOP codon = 0).

meta_counts_START

This file contains cumulative counts for each genomic position in the region -100 to +99 from CDS start of all the protein-coding genes (0 - the first nucleotide of the START codon). There are two columns - the first one keeps the relative positions, the second one - the cumulative counts.

meta_counts_TERM

This file contains cumulative counts for each genomic position in the region -97 to +102 from CDS stop of all the protein-coding genes (0 - the first nucleotide of the STOP codon). There are two columns - the first one keeps the relative positions, the second one - the cumulative counts.

meta_count_peaks_START

This file provides significance p-values of the peaks observed in the meta-counts at the start. Significance computations assume that the counts follow a Poisson distribution with lambda equal to the 0.84 quantile of the meta-counts (an approximation for the median of each peak in each codon). The probability of a count falling into this distribution can be regarded as significance p-value for the count being a peak.

The file outputs sorted counts that have p-values less than 0.01 and that are the highest points compared to two neighboring positions.

There are four columns - (1) the index, (2) the count relative positions, (3) the counts and (4) the p-values.

meta_count_peaks_TERM

This file provides significance p-values of the peaks observed in the meta-counts at the termination. Significance computations assume that the counts follow a Poisson distribution with lambda equal to the 0.84 quantile of the meta-counts (an approximation for the median of each peak in each codon). The probability of a count falling into this distribution can be regarded as significance p-value for the count being a peak.

The file outputs sorted counts that have p-values less than 0.01 and that are the highest points compared to two neighboring positions.

There are four columns - (1) the index, (2) the count relative positions, (3) the counts and (4) the p-values.

frame_counts_START

This file contains the number of read 5’ endpoints in each translation frame for each transcript. The frame is counted relative to the CDS start. After the index column there are three columns representing the frames: F0 (corresponds to the first nucleotide of the START codon), F1 and F2. Each row is a transcript, ordered as in the transcript_assembly.txt file.

        F0      F1      F2
0       0       0       0
1       0       0       0
2       0       0       0
3       0       0       0

frame_counts_TERM

This file contains the number of read 5’ endpoints in each translation frame for each transcript. The frame is counted relative to the CDS stop. After the index column there are three columns representing the frames: F0 (corresponds to the first nucleotide of the STOP codon), F1 and F2. Each row is a transcript, ordered as in the transcript_assembly.txt file.

frame_stats

This file contains general statistics on the count frame preference for frames F0, F1 and F2 described above. Each frame is described the following fields:

f_count:        the raw counts at each frame
f_perc:         the fraction of counts falling into each frame
fpi:            frame protection index computed as (log\ :sub:`2`\(2F\ :sub:`i`\/(F\ :sub:`total`\ - F\ :sub:`i`\)))
p_val_pair:     the p value of t-test computed on the log2 ratio of transcript-wise counts between the frame and its neighboring frame to the right (F0 vs F1, F1 vs F2 or F2 vs F0)
p_val_pair_max: the p value of t-test computed on the log2 ratio of transcript-wise counts between the frame and another frame with closest counts to it

transcript_frame_prefs

For each transcript, this file contains columns in the following order:

count:          the total counts at the transcript
F:              the frame of preference {0,1,2}
f_count:        the counts at the preferred frame
f_perc:         the percentage of those counts compared to the total counts at the transcript
fpi:            the frame protection index (FPI)

The FPI is computed as (log2(2Fi/(Ftotal- Fi)))

fft_stats.txt

This file contains information regarding periodicity of read 5’ endpoints across transcript coordinates. Fivepseq applies Fast Fourier transformation (FFT) to compute presence and strength of periodic count waves. The transcripts are aligned either at the start (prefix START) or at the end (prefix TERM), truncated at the 0.75 percentile of lengths and short transcripts are padded with zeros. The counts are then summed at each position across the transcripts. The FFT is applied on the resulting metavector.

The file contains the following statistics for the top five periodicity waves (marked with columns named 0-to-4)

START_periods:      the periodicity, number of nucleotides
START_signals:      the strength of the signal (the real component of FFT transformation)
START_scales:       the relative strength of the signal compared to background
TERM_periods:       the periodicity, number of nucleotides
TERM_signals:       the strength of the signal (the real component of FFT transformation)
TERM_scales:        the relative strength of the signal compared to background

fft_signals_start

This file contains two columns with the START_period (periodicity) and START_signal (strength of the FFT signal at that periodicity) statistics described above. All of the identified waves are included (as opposed to top five only in the “fft_stats.txt” file).

fft_signals_term

This file contains two columns with the TERM_period (periodicity) and TERM_signal (strength of the FFT signal at that periodicity) statistics described above. All of the identified waves are included (as opposed to top five only in the “fft_stats.txt” file).

transcript_fft

This file contains FFT statistics for each transcript. Each row is described by the followings stats:

count:      the total count of read 5' endpoints at the transcript
period:     the periodicity at the strongest FFT wave
signal:     the strength of the FFT signal at that periodicity

amino_acid_pauses

Cumulative raw counts at certain distance (columns -50 to -1) from each amino acid (rows) are indicated. The distance is counted from the first nucleotide of each amino acid, with the first nucleotide considered as position 0 (not in the file).

codon_pauses

Cumulative raw counts at certain distance (columns -30 to +5) from each codon(rows) are indicated. The distance is counted from the first nucleotide of each codon, with the first nucleotide considered as position 0.

dicodon_pauses

Cumulative raw counts at certain distance (columns -27 to +5) from the top 50 dicodons (rows) are indicated. The distance is counted from the first nucleotide of each dicodon, with the first nucleotide considered as position 0. The dicodons are sorted from highest to lowest based on counts in the -14nt distance (if not changed at input) relative to the background counts in the range.

dipeptide_pauses

Cumulative raw counts at certain distance (columns -27 to +5) from the top 50 dipeptides (rows) are indicated. The distance is counted from the first nucleotide of each dipeptide motif, with the first nucleotide considered as position 0. The dipeptides are sorted from highest to lowest based on counts in the -14nt distance (if not changed at input) relative to the background counts in the range.

tricodon_pauses

Cumulative raw counts at certain distance (columns -24 to +8) from the top 50 tricodons (rows) are indicated. The distance is counted from the first nucleotide of each tricodon, with the first nucleotide considered as position 0. The tricodons are sorted from highest to lowest based on counts in the -11nt distance (if not changed at input) relative to the background counts in the range.

tripeptide_pauses

Cumulative raw counts at certain distance (columns -24 to +8) from the top 50 tripeptides (rows) are indicated. The distance is counted from the first nucleotide of each tripeptide motif, with the first nucleotide considered as position 0. The tripeptides are sorted from highest to lowest based on counts in the -11nt distance (if not changed at input) relative to the background counts in the range.