Telofinder

A python package to determine the location and the size of telomeric repeats (both terminal and internal) from genome assemblies. This version of the program only works for yeast genomes (telomere sequence TG1-3)

Telomere detection is based on calculation in a 20 bp sliding window of the following two metrics:
  • DNA sequence entropy < entropy_threshold (default=0.8)

  • proportion of polynucleotides (default_list = ["CC", "CA", "AC"]) > polynuc_threshold (default=0.8)

_images/polynuc_entropy_8kb.png _images/zoom_term.png _images/zoom_int.png

Installation

You need to have bedtools in your PATH for telofinder to work.

Install (mini)conda:

Configure channels and add dependencies:

$ conda config --add channels conda-forge bioconda
$ conda create -n telofinder python=3.10
$ conda activate telofinder
$ conda install bedtools

install the telofinder package:

$ git clone https://github.com/GillesFischerSorbonne/telofinder.git
$ cd telofinder
$ pip install .

Usage

Using the conda environment created during Installation:

$ conda activate telofinder
$ telofinder [fasta_path]

Positional arguments

fasta_path

path to a single (multi)fasta file or to a directory containing multiple fasta files

Optional arguments

-f, --force

automatically replace the output 'telofinder_results' directory if present

-e, --entropy_threshold

entropy threshold for telomere prediction, default = 0.8

-n, --polynuc_threshold

poly-nucleotide threshold for telomere prediction, default = 0.8

-s, --nb_scanned_nt

total number of nucleotides scanned for telomere detection, starting from each chromosome extremity. If set to -1, the whole chromosome sequences will be scanned, default = 20 000 bp

-t, --threads

number of threads to use, default = 1

-r, --raw

outputs the raw dataframe (raw_df.csv) containing the values of all sliding windows

Help

$ python telofinder.py --help

Output

Telofinder outputs a directory called telofinder_results including 2 csv and 2 bed files containing the telomere calls and their coordinates, either as raw output or after merging consecutive calls

Reference

base_compos(sequence, base)[source]

Counts the number of a given base in a sequence

Parameters:
  • sequence -- fasta sequence

  • base -- base to count in the sequence

Returns:

the number of that base in the sequence

classify_telomere(interval_chrom, chrom_len)[source]

From a list of tuples obtained from get_consecutive_groups, identify if interval corresponds to terminal or interal telomere

compute_metrics(window, polynucleotide_list=['AC', 'CA', 'CC'])[source]

Compute entropy and polynucleotide proportion in the sequence window

Parameters:
  • window -- sliding window

  • polynucleotide_list -- a list of polynucleotides, default value is ["AC", "CA", "CC"]

Returns:

a dictionary of entropy and polynucleotide proportion of the sequence window

count_polynuc_occurence(sub_window, polynucleotide_list)[source]

Define presence of polynucleotide in the window.

Parameters:
  • sub_window -- the sequence of a sub_window

  • polynuleotide_list -- a list of polynucleotides. Note that all polynucleotides must be of the same size

Returns:

a boolean for the presence of the sub_window in the polynucleotide list

export_results(raw_df, telom_df, merged_telom_df, raw, outdir)[source]

Produce output table files

get_consecutive_groups(df_chrom)[source]

From the raw dataframe get start and end of each telomere window. Applied to detect start and end of telomere in nucleotide positions.

get_entropy(window)[source]

Calculate the entropy of the window DNA sequence

Parameters:

window -- sliding window

Returns:

entropy value of the sequence window

get_polynuc(window, polynucleotide_list)[source]

get the propbortion of polynuceotides in the window

Parameters:
  • window -- sliding window

  • polynucleotide_list -- a list of polynucleotides. Note that all polynucleotides must be of the same size

Returns:

total polynucleotide proportion in the sliding window

get_strain_name(filename)[source]

Function to get the strain name from the name of the fasta file

Parameters:

filename -- path of fasta file

Returns:

sequence name

get_telomere_in_telomeric_reads(bam_file, telo_df_merged, outdir, polynuc_thres, entropy_thres, nb_scanned_nt, threads)[source]

Extract telomeric reads from a bam file corresponding to telomere detected and reported in telo_df_merged

Parameters:
  • bam_file -- An indexed bam alignment file.

  • telo_df_merged -- Merged DataFrame with telomeric informations (from one of the run_telofinder functions)

output_dir_exists(force)[source]

Function to test if the 'telofinder_results' output diretory already exists.

Parameters:

force -- overwrites the output directory otherwise exits program --force or -f

Returns:

nothing

parse_arguments()[source]

Function to parse and reuse the arguments of the command line

Parameters:
  • fasta_path -- path to a single fasta file or to a directory containing multiple fasta files

  • force -- force optional, overwrites the output directory otherwise exits program, optional

  • entropy_threshold -- optional, default = 0.8

  • polynuc_threshold -- optional, default = 0.8

  • nb_scanned_nt -- number of scanned nucleotides at each chromosome end, optional, default = 20 000

  • threads -- Number of threads to use. Multithreaded calculations currently occurs at the level of sequences within a fasta file."

  • raw -- Outputs raw_df.csv containing the values of all sliding windows

Returns:

parser arguments

run_on_fasta_dir(fasta_dir_path, polynuc_thres, entropy_thres, nb_scanned_nt, direction, threads)[source]

Run iteratively the telemore detection algorithm on all fasta files in a directory

Parameters:

fasta_dir -- path to fasta directory

Returns:

a tuple of df, telo_df and telo_df_merged

run_on_single_fasta(fasta_path, polynuc_thres, entropy_thres, nb_scanned_nt, direction, threads)[source]

Run the telomere detection algorithm on a single fasta file

Parameters:

fasta_path -- path to fasta file

Returns:

a tuple of df, telo_df and telo_df_merged

run_on_single_seq(seq_record, strain, polynuc_thres, entropy_thres, nb_scanned_nt, direction)[source]
run_telofinder(fasta_path, polynuc_thres, entropy_thres, nb_scanned_nt, direction, threads, raw, outdir)[source]

Run telofinder on a single fasta file or on a fasta directory

sliding_window(sequence, start, end, size)[source]

Apply a sliding window of length = size to a sequence from start to end

Parameters:
  • sequence -- fasta sequence

  • start -- starting coordinate of the sequence

  • end -- ending coordinate of the sequence

  • size -- size of the sliding window

Returns:

the coordinate and the sequence of the window

Indices and tables