.. _count_files: *********************** Count files *********************** ============================================= fivepseq_count directory listing ============================================= Sometimes you might want to use the files that contain raw counts or stats in text formats for alternative downstream applications. Fivepseq stores the count files in the fivepseq_counts directory in the output folder. Each sample has a default folder named **protein_coding**, where the count files and stats for all protein coding genes are stored. If fivepseq is run with the ``--geneset`` option enabled, count folders corresponding to each gene set will also appear along with the **protein_coding** folder. The files you should find inside each of these folders are listed below. .. code-block:: shell fivepseq ├── fivepseq_counts │ ├── 1_eIF5A_mut_R1 │ ├── protein_coding │ ├── amino_acid_pauses.txt │ ├── canonical_transcript_index.txt │ ├── codon_pauses.txt │ ├── count_distribution.txt │ ├── counts_FULL_LENGTH.txt │ ├── counts_START.txt │ ├── counts_TERM.txt │ ├── data_summary.txt │ ├── dicodon_pauses.txt │ ├── dipeptide_pauses.txt │ ├── fft_signals_start.txt │ ├── fft_signals_term.txt │ ├── fft_stats.txt │ ├── frame_counts_START.txt │ ├── frame_counts_TERM.txt │ ├── frame_stats.txt │ ├── meta_count_peaks_START.txt │ ├── meta_count_peaks_TERM.txt │ ├── meta_counts_TERM.txt │ ├── meta_counts_TERM.txt │ ├── meta_counts_START.txt │ ├── meta_counts_TERM.txt │ ├── outlier_lower.txt │ ├── outliers_df.txt │ ├── start_codon_dict.txt │ ├── term_codon_dict.txt │ ├── transcript_assembly.txt │ ├── transcript_descriptors.txt │ ├── transcript_fft.txt │ ├── transcript_frame_prefs.txt │ ├── tricodon_pauses.txt │ ├── tripeptide_pauses.txt │ ├── 2_WT_R3 │ ├── 3_eIF5A_mut_R1 ├── fivepseq_plots ├── log ├── count_summary.txt data_summary _____________________________ This file contains general statistics on the number of reads in the protein coding regions (or the geneset) used by fivepseq. .. code-block:: shell NumOfReads: Number of reads in the coding regions, accounting for transcript filters NumOfMapPositions: Number of genome positions with at leads one mapping reads (more than one reads in the same position are counted as one) NumOfReadsDownsampled: Number of reads in the coding regions after downsampling (noise removal) NumOfTranscripts: Number of transcripts in the given annotation for the given filter NumOfMapTranscripts: Number of transcripts with at least one mapping read MinNumOfReadsPerTranscript: Minumum number of reads per transcript MaxNumOfReadsPerTranscript: Maximum number of reads per transcript MedianNumOfReadsPerTranscript: Median number of reads per transcript transcript_assembly _____________________________ This file contains transcript descriptors: a row per transcript and six fields (columns). Note that transcript assemblies for each gene set filter will be different and will only include transcripts in the filter. Also, if ``--genefilter`` option is given, only the filtered transcripts will be included in the assembly. .. code-block:: shell ID: transcript ID gene: gene ID (if found in the annotation) chr: chromosome name cds_start: start coordinate of the CDS cds_end: end coordinate of the CDS type: the transcript type (e.g. mRNA, tRNA, etc.) transcript_descriptors _____________________________ This file contains basic stats for each transcript. The order of transcripts corresponds to that of the transcript_assembly.txt. Each transcript is described by six fields: .. code-block:: shell start: the codon at transcript CDS start stop: the codon at transcript CDS stop len: transcript length 3nt: presence of 3-nucleotide periodicity in the transcript (1 for present, 0 for absent) NumOfReads: number of reads mapping to that transcript NumOfReadsDownsampled: number of reads mapping to that transcript after noise removal NumOfMapPositions: number of positions in the transcript with at least one mapped read (multiple reads per position are counted as one) start_codon_dict.txt _____________________________ This file lists all the codons present at CDS start of the transcripts. For each codon, the number of transcripts where it is present is indicated. The codons are sorted in decreasing order of occurrence. .. code-block:: shell ATG 6590 TTG 1 GCT 2 AAC 1 ATA 6 term_codon_dict.txt _____________________________ This file lists all the codons present at CDS stop of the transcripts. For each codon, the number of transcripts where it is present is indicated. The codons are sorted in decreasing order of occurrence. .. code-block:: shell TAG 1511 TGA 1951 TAA 3138 canonical_transcript_index _____________________________ The indices (as they appear in the transcript_descriptors.txt) of transcripts containing canonical start (AUG) and termination ({TAG, TGA, TAA}) codons are listed in this file. count_distribution _____________________________ This file describes the distribution of read 5' endpoint counts (those greater than 0). outlier_lower _____________________________ This file contains the maximum count value that falls into Poisson distribution according to the count distribution in the file above. All the counts greater than this value are down-scaled to it. outliers_df _____________________________ This file contains genomic positions with counts greater than the maximum allowed count (indicated in *outlier_lower.txt*) and shows the down-scaled (downsampled) count for it. Each such genomic position is described by the following fields: .. code-block:: shell ID: transcript ID Name: gene Name (if such attribute exists for the annotation) position_from_start: the relative position of the location from the transcript CDS start (first nucleotide = 0) position_from_term: the relative position of the location from the transcript CDS stop (last nucleotide = 0) actual_count: the actual count of read 5' endpoint at the position downasampled_count: the count set at the location after noise removal counts_FULL_LENGTH _____________________________ Each row is a transcript in the order of transcript_assembly.txt. The counts correspond to the number of read 5' endpoints mapping to each position in the transcript. The coordinates span the CDS and -100/+100 nucleotides around it. counts_START _____________________________ Each row is a transcript in the order of transcript_assembly.txt. The counts correspond to the number of read 5' endpoints mapping to each position within the initial part of the transcript. The coordinates start from -100 to +99 from start (first nucleotide = 0). counts_TERM _____________________________ Each row is a transcript in the order of transcript_assembly.txt. The counts correspond to the number of read 5' endpoints mapping to each position within the terminal part of the transcript. The coordinates start from -97 to +102 from stop (first nucleotide of the STOP codon = 0). meta_counts_START _____________________________ This file contains cumulative counts for each genomic position in the region -100 to +99 from CDS start of all the protein-coding genes (0 - the first nucleotide of the START codon). There are two columns - the first one keeps the relative positions, the second one - the cumulative counts. meta_counts_TERM _____________________________ This file contains cumulative counts for each genomic position in the region -97 to +102 from CDS stop of all the protein-coding genes (0 - the first nucleotide of the STOP codon). There are two columns - the first one keeps the relative positions, the second one - the cumulative counts. meta_count_peaks_START _____________________________ This file provides significance p-values of the peaks observed in the meta-counts at the start. Significance computations assume that the counts follow a Poisson distribution with lambda equal to the 0.84 quantile of the meta-counts (an approximation for the median of each peak in each codon). The probability of a count falling into this distribution can be regarded as significance p-value for the count being a peak. The file outputs sorted counts that have p-values less than 0.01 and that are the highest points compared to two neighboring positions. There are four columns - (1) the index, (2) the count relative positions, (3) the counts and (4) the p-values. meta_count_peaks_TERM _____________________________ This file provides significance p-values of the peaks observed in the meta-counts at the termination. Significance computations assume that the counts follow a Poisson distribution with lambda equal to the 0.84 quantile of the meta-counts (an approximation for the median of each peak in each codon). The probability of a count falling into this distribution can be regarded as significance p-value for the count being a peak. The file outputs sorted counts that have p-values less than 0.01 and that are the highest points compared to two neighboring positions. There are four columns - (1) the index, (2) the count relative positions, (3) the counts and (4) the p-values. frame_counts_START _____________________________ This file contains the number of read 5' endpoints in each translation frame for each transcript. The frame is counted relative to the CDS start. After the index column there are three columns representing the frames: F0 (corresponds to the first nucleotide of the START codon), F1 and F2. Each row is a transcript, ordered as in the **transcript_assembly.txt** file. .. code-block:: shell F0 F1 F2 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 frame_counts_TERM _____________________________ This file contains the number of read 5' endpoints in each translation frame for each transcript. The frame is counted relative to the CDS stop. After the index column there are three columns representing the frames: F0 (corresponds to the first nucleotide of the STOP codon), F1 and F2. Each row is a transcript, ordered as in the transcript_assembly.txt file. frame_stats _____________________________ This file contains general statistics on the count frame preference for frames F0, F1 and F2 described above. Each frame is described the following fields: .. code-block:: shell f_count: the raw counts at each frame f_perc: the fraction of counts falling into each frame fpi: frame protection index computed as (log\ :sub:`2`\(2F\ :sub:`i`\/(F\ :sub:`total`\ - F\ :sub:`i`\))) p_val_pair: the p value of t-test computed on the log2 ratio of transcript-wise counts between the frame and its neighboring frame to the right (F0 vs F1, F1 vs F2 or F2 vs F0) p_val_pair_max: the p value of t-test computed on the log2 ratio of transcript-wise counts between the frame and another frame with closest counts to it transcript_frame_prefs _____________________________ For each transcript, this file contains columns in the following order: .. code-block:: shell count: the total counts at the transcript F: the frame of preference {0,1,2} f_count: the counts at the preferred frame f_perc: the percentage of those counts compared to the total counts at the transcript fpi: the frame protection index (FPI) The FPI is computed as (log\ :sub:`2`\(2F\ :sub:`i`\/(F\ :sub:`total`\ - F\ :sub:`i`\))) fft_stats.txt _____________________________ This file contains information regarding periodicity of read 5' endpoints across transcript coordinates. Fivepseq applies Fast Fourier transformation (FFT) to compute presence and strength of periodic count waves. The transcripts are aligned either at the start (prefix START) or at the end (prefix TERM), truncated at the 0.75 percentile of lengths and short transcripts are padded with zeros. The counts are then summed at each position across the transcripts. The FFT is applied on the resulting metavector. The file contains the following statistics for the top five periodicity waves (marked with columns named 0-to-4) .. code-block:: shell START_periods: the periodicity, number of nucleotides START_signals: the strength of the signal (the real component of FFT transformation) START_scales: the relative strength of the signal compared to background TERM_periods: the periodicity, number of nucleotides TERM_signals: the strength of the signal (the real component of FFT transformation) TERM_scales: the relative strength of the signal compared to background fft_signals_start _____________________________ This file contains two columns with the START_period (periodicity) and START_signal (strength of the FFT signal at that periodicity) statistics described above. All of the identified waves are included (as opposed to top five only in the "fft_stats.txt" file). fft_signals_term _____________________________ This file contains two columns with the TERM_period (periodicity) and TERM_signal (strength of the FFT signal at that periodicity) statistics described above. All of the identified waves are included (as opposed to top five only in the "fft_stats.txt" file). transcript_fft _____________________________ This file contains FFT statistics for each transcript. Each row is described by the followings stats: .. code-block:: shell count: the total count of read 5' endpoints at the transcript period: the periodicity at the strongest FFT wave signal: the strength of the FFT signal at that periodicity amino_acid_pauses _____________________________ Cumulative raw counts at certain distance (columns -50 to -1) from each amino acid (rows) are indicated. The distance is counted from the first nucleotide of each amino acid, with the first nucleotide considered as position 0 (not in the file). codon_pauses _____________________________ Cumulative raw counts at certain distance (columns -30 to +5) from each codon(rows) are indicated. The distance is counted from the first nucleotide of each codon, with the first nucleotide considered as position 0. dicodon_pauses _____________________________ Cumulative raw counts at certain distance (columns -27 to +5) from the top 50 dicodons (rows) are indicated. The distance is counted from the first nucleotide of each dicodon, with the first nucleotide considered as position 0. The dicodons are sorted from highest to lowest based on counts in the -14nt distance (if not changed at input) relative to the background counts in the range. dipeptide_pauses _____________________________ Cumulative raw counts at certain distance (columns -27 to +5) from the top 50 dipeptides (rows) are indicated. The distance is counted from the first nucleotide of each dipeptide motif, with the first nucleotide considered as position 0. The dipeptides are sorted from highest to lowest based on counts in the -14nt distance (if not changed at input) relative to the background counts in the range. tricodon_pauses _____________________________ Cumulative raw counts at certain distance (columns -24 to +8) from the top 50 tricodons (rows) are indicated. The distance is counted from the first nucleotide of each tricodon, with the first nucleotide considered as position 0. The tricodons are sorted from highest to lowest based on counts in the -11nt distance (if not changed at input) relative to the background counts in the range. tripeptide_pauses _____________________________ Cumulative raw counts at certain distance (columns -24 to +8) from the top 50 tripeptides (rows) are indicated. The distance is counted from the first nucleotide of each tripeptide motif, with the first nucleotide considered as position 0. The tripeptides are sorted from highest to lowest based on counts in the -11nt distance (if not changed at input) relative to the background counts in the range.