smeagol package

Submodules

smeagol.aggregate module

smeagol.aggregate.collect_data(file_pattern, file_ids, col_with_index, col_to_collect)

This function is used to aggregate multiple SMEAGOL result files, for example on different genomes. It collects a specific column (col_to_collect) from all the result files and merges these columns into a single dataframe. The merge is based on a specific index column (col_with_index) that should be present in all files.

Parameters:
  • file_pattern (str) – pattern to search for in filenames

  • file_ids (list) – list of IDs, one per file

  • col_with_index (str) – Name of the column to set as index

  • col_to_collect (str) – Name of the column to collect

Returns:

A concatenated dataframe containing the

col_to_collect column from all files matching the pattern file_pattern.

Return type:

df_all (pandas DataFrame)

smeagol.encode module

class smeagol.encode.SeqEncoding(records, rcomp='none', sense=None)

Bases: object

This class encodes a single nucleic acid sequence, or a set of sequences, all of which must have the same length. Genomic sequences used to initialize the class must also have the same sense.

len

Length of the sequences

Type:

int

ids

Numpy array containing the IDs of all sequences.

Type:

np.array

names

Numpy array containing the names of all sequences.

Type:

np.array

seqs

Numpy array containing the integer encoded sequences.

Type:

np.array

senses

Numpy array containing the senses (‘+’ or ‘-’) of all sequences.

Type:

np.array

class smeagol.encode.SeqGroups(records, rcomp='none', sense=None, group_by=None)

Bases: object

This class encodes one or more groups of equal-length nucleic acid sequences. Sequences in different groups may have different lengths.

smeagol.encode.integer_encode(seq, rc=False)

Function to encode a nucleic acid sequence as a sequence of integers.

Parameters:
  • seq (str, Seq or SeqRecord object) – Nucleic acid sequence to encode.

  • A (Allowed characters are) –

  • C

  • G

  • T

  • U

  • Z

  • N

  • W

  • S

  • M

  • K

  • R

  • Y

  • B

  • D

  • H

  • V.

  • rc (bool) – If True, reverse complement the sequence before encoding

Returns:

Numpy array containing the integer encoded sequence. The shape is (L, ) where L is the length of the sequence.

Return type:

result (np.array)

smeagol.encode.one_hot_encode(seq, rc=False)

Function to one-hot encode a nucleic acid sequence.

Parameters:
  • seq (str, Seq or SeqRecord object) – Nucleic acid sequence to encode.

  • A (Allowed characters are) –

  • C

  • G

  • T

  • U

  • Z

  • N

  • W

  • S

  • M

  • K

  • R

  • Y

  • B

  • D

  • H

  • V.

  • rc (bool) – If True, reverse complement the sequence before encoding

Returns:

Numpy array containing the one-hot encoded sequence. The shape is (L, 4) where L is the length of the sequence.

Return type:

result (np.array)

smeagol.enrich module

smeagol.enrich.enrich_in_genome(records, model, rcomp, threshold, simN=1000, simK=2, sense='+', background='binomial', verbose=False, combine_seqs=True, seq_batch=0, shuf_seqs=None)

Function to calculate enrichment of motifs in given sequence(s) relative to a background. The background sequences can be supplied directly or generated by shuffling the given sequence(s).

Parameters:
  • records (list) – A list of seqrecord objects containing the sequence(s) to test. This list can be generated by reading a fasta file using ‘smeagol.io.read_fasta’.

  • model (PWMModel) – An object of class PWMModel for PWM scanning

  • rcomp (str) – ‘only’ to encode the reverse complements of the given sequences, ‘both’ to encode the reverse complements as well as the given sequences, or ‘none’ to use only the given sequences without reverse complementing.

  • threshold (float) – Threshold motif match score to use for PWM scanning. A sub-sequence that has a PWM match score higher than this will be counted as a match.

  • simN (int) – The number of shuffled sequences to generate for the background model.

  • simK (int) – The k-mer frequency to conserve while shuffling

  • sense (str) – Sense of the given sequences. Must be ‘+’ or ‘-’

  • background (str) – Background model for calculating p-values. Must be ‘binomial’ or ‘normal’

  • combine_seqs (bool) – If true, combine outputs for all supplied sequences into a single output dataframe

  • seq_batch (int) – The number of shuffled sequences to scan at a time. If 0, scan all at once.

  • shuf_seqs (str/list) – A path to FASTA file or list of seqrecord objects containing shuffled sequences. Use if loading the background sequences directly instead of generating them by shuffling.

Returns:

dictionary containing results. Contains the following

keys: ‘enrichment’ (enrichment statistics), ‘real_sites’ (locations of motif matches on the given sequences), ‘real_counts’ (number of sites for each PWM on the given sequences), ‘shuf_counts’ (number of sites for each PWM on the background sequences), ‘shuf_stats’ (mean and standard deviation of the number of sites for each PWM on the background sequences).

Return type:

results (dict)

smeagol.enrich.enrich_in_sliding_windows(sites, genome, width, shift=None, matrix_id=None, alternative='two-sided')

Function to test for enrichment of motifs in sliding windows across a genome.

Parameters:
  • sites (pd.DataFrame) – A pandas dataframe containing the locations of motifs in the genome. Should contain columns id (the sequence ID), start and end.

  • genome (list) – A list of SeqRecord objects.

  • width (int) – The width (in bases) of tiling windows to test.

  • shift (int) – The shift (in bases) between successive windows. If not supplied, non-overlapping windows are generated.

  • matrix_id (str) – Optional. The ID of the PWM to test. If supplied, it should correspond to an entry in the column Matrix_id of sites.

  • alternative (str) – Defines the alternative hypothesis. Options are ‘two-sided’, ‘less’ or ‘greater’. Default is ‘two-sided’.

Returns:

A Pandas DataFrame containing the results of enrichment analysis in windows covering all sequences in genome

smeagol.enrich.enrich_in_window(window, sites, genome, matrix_id=None, alternative='two-sided')

Function to test for enrichment of motifs in a subsequence relative to a complete sequence. This function uses Fisher’s exact test.

Parameters:
  • window (pd.DataFrame) – A Pandas dataframe with the following columns: id (sequence ID), start, end

  • sites (pd.DataFrame) – A pandas dataframe containing the locations of motifs in the genome. Should contain columns id (sequence ID, start, end.

  • genome (list) – A list of SeqRecord objects

  • matrix_id (str) – Optional. The ID of the selected PWM. If supplied, it must correspond to an entry in the column ‘Matrix_id’ of sites.

  • alternative (str) – Defines the alternative hypothesis. Options are ‘two-sided’, ‘less’ or ‘greater’. Default is ‘two-sided’.

Returns:

Pandas dataframe containing the result of

enrichment analysis in selected window.

Return type:

result (pd.DataFrame)

smeagol.enrich.examine_thresholds(records, model, simN=300, simK=2, rcomp='only', min_threshold=0.5, threshold_shift=0.1, sense='+', verbose=False, combine_seqs=True)

Function to scan given sequence(s) with PWMs and calculate the number of binding sites predicted using different cutoffs for the match score, for the real sequence as well as shuffled sequences.

Parameters:
  • records (list) – A list of seqrecord objects containing the sequence(s) to scan.

  • model (PWMModel) – An object of class PWMModel for PWM scanning

  • simN (int) – The number of shuffled sequences to generate for the background model. Default is 300.

  • simK (int) – The k-mer frequency to conserve while shuffling. Default is 2.

  • rcomp (str) – ‘only’ to encode the sequence reverse complements, ‘both’ to encode the reverse complements as well as original sequences, or ‘none’. Default is ‘only’.

  • min_threshold (float) – Minimum threshold motif match score to use for PWM scanning. Default is 0.5.

  • threshold_shift (float) – Difference between successive thresholds to compare. Default is 0.1.

  • sense (str) – Sense of the given sequences. Must be ‘+’ or ‘-’. Default is ‘+’.

  • verbose (bool) – If true, output all information

  • combine_seqs (bool) – If true, combine outputs for all supplied sequences into a single output dataframe

Returns:

dictionary containing results. Contains the following

keys: ‘real_binned’ (Number of motif matches for each PWM in the given sequences, binned by their match score), ‘shuf_binned’ (Number of motif matches for each PWM in the background sequences, binned by their match score)

Return type:

results (dict)

smeagol.io module

smeagol.io.load_attract()

Function to download all motifs and metadata from ATtRACT and load them as a pandas DF.

Returns:

contains matrices

Return type:

df (pandas df)

smeagol.io.load_rbpdb(species='H.sapiens', matrix_type='PWM')

Function to download all motifs and metadata from RBPDB and load them as a pandas DF. Downloads version 1.3.1.

Parameters:
  • species (str) – ‘H.sapiens’, ‘M.musculus’, ‘D.melanogaster’ or ‘C.elegans’.

  • matrix_type (str) – ‘PWM’ or ‘PFM’

Returns:

contains matrices

Return type:

df (pandas df)

smeagol.io.load_smeagol_PWMset(dataset='representative')

Function to download the curated set of motifs used in the SMEAGOL paper and load them as a pandas dataframe.

Parameters:

dataset – ‘full’ or ‘representative’.

Returns:

contains matrices

Return type:

df (pandas df)

smeagol.io.read_fasta(file)

Function to read sequences from a fasta or fasta.gz file.

Parameters:

file (str) – path to file

Returns:

list of seqRecord objects, one for each sequence in the fasta file.

Return type:

records (list)

smeagol.io.read_pms_from_dir(dirname, matrix_type='PPM', transpose=False)

Function to read position matrices from a directory with separate files for each PM.

The input directory is expected to contain individual files each of which represents a separate matrix.

The file name should be the name of the PWM followed by an extension. If _ is included in the name, the characters after the _ will be dropped.

Matrices are by default expected to be in the position x base format, i.e. one row per position and one column per base (A, C, G, T). If the matrices are instead in the base x position format, set transpose=True.

Parameters:
  • dirname (str) – path to the folder containing PMs in individual files

  • matrix_type (str) – ‘PWM’, ‘PPM’ (default) or ‘PFM’

  • transpose (bool) – If true, transpose the matrix

Returns:

pandas dataframe containing PMs

smeagol.io.read_pms_from_file(file, matrix_type='PPM', check_lens=False, transpose=False, delimiter='\t')

Function to read position matrices from a FASTA-like file.

The input file is expected to follow the format:

>Matrix_1_ID matrix values >Matrix_2_ID matrix values

Matrices are by default expected to be in the position x base format, i.e. one row per position and one column per base (A, C, G, T). If the matrices are instead in the base x position format, set transpose=True.

Optionally, the ID rows may also contain a second field indicating the length of the matrix, for example:

>Matrix_1_ID<tab>7

The pwm.txt file downloaded from the Attract database follows this format. In this case, you can set check_lens=True to confirm that the loaded PWMs match the expected lengths.

Parameters:
  • pm_file (str) – path to the file containing PMs

  • matrix_type (str) – ‘PWM’, ‘PPM’ (default) or ‘PFM’

  • check_lens (bool) – check that the matrix lengths match the lengths provided in the file

  • transpose (bool) – transpose the matrices

  • delimiter (str) – The string used to separate values in the file

Returns:

pandas dataframe containing PMs

smeagol.io.write_fasta(records, file, gz=True)

Function to write sequences to a fasta or fasta.gz file.

Params:

records (list): list of seqRecord objects to write. file (str): path to file gz (bool): If true, compress the output file with gzip.

Returns:

Writes records to the file

smeagol.matrices module

smeagol.matrices.align_pms(X, Y, min_overlap=None)

Function to generate all possible overlaps between two position matrices.

Inputs:

X, Y (np.array): two position matrices. min_overlap (int): minimum overlap allowed between matrices

Returns:

tuples containing matrix start and end positions

corresponding to each possible alignment of the two matrices.

Return type:

result

smeagol.matrices.check_pfm(freqs, warn=False)

Function to check the validity of a PFM. A valid PFM should have non-negative integer values and exactly 4 columns.

Parameters:

freqs (np.array) – Numpy array containing frequency values. The array should be 2-dimensional and have 4 columns.

Raises:

ValueError – if freqs is an invalid PFM.

smeagol.matrices.check_ppm(probs, warn=False, eps=0.01)

Function to check the validity of a PPM. A valid PPM should have all values within the range [0,1], all rows summing to 1 (within a tolerance of eps), and exactly 4 columns.

Parameters:

probs (np.array) – Numpy array containing probability values. The array should be 2-dimensional and have 4 columns.

Raises:

ValueError – if probs is an invalid PPM.

smeagol.matrices.check_pwm(weights)

Function to check the validity of a PWM. A valid PWM should have exactly 4 columns.

Parameters:

weights (np.array) – Numpy array containing PWM weights. The array should be 2-dimensional and have 4 columns.

Raises:

ValueError – if weights is an invalid PWM.

smeagol.matrices.choose_cluster_representative_pms(df, sims=None, clusters=None, maximize='median', weight_col='weights', matrix_type='PWM')

Function to choose a representative position matrix from each cluster. This performs choose_representative_pm to choose a representative matrix within each supplied cluster.

Parameters:
  • df (pandas df) – Dataframe containing position matrix values and IDs.

  • sims (np.array) – pairwise similarities between all PWMs in pwms

  • clusters (list) – cluster assignments for each PWM.

  • maximize (str) – ‘mean’ or ‘median’. Metric to choose representative matrix.

  • weight_col (str) – column in pwms that contains matrix values.

  • matrix_type (str) – PPM or PWM

Returns:

IDs for the selected representative matrices.

Return type:

representatives (list)

smeagol.matrices.choose_representative_pm(df, sims=None, maximize='median', weight_col='weights', matrix_type='PWM')

Function to choose a representative position matrix from a group. If 2 matrices are supplied, the one with lower entropy is chosen. For larger sets of matrices, the matrix with the highest mean (or median) normalized Pearson correlation to all other matrices is chosen.

Parameters:
  • df (pandas df) – Dataframe containing position matrix values and IDs. The ID column should be named ‘Matrix_id’.

  • sims (np.array) – pairwise similarities between all PWMs in df

  • maximize (str) – ‘mean’ or ‘median’. Metric to choose representative matrix.

  • weight_col (str) – the column in df that contains matrix values.

  • matrix_type (str) – PPM or PWM

Returns:

IDs for the selected representative matrices.

Return type:

result (list)

smeagol.matrices.cluster_pms(df, n_clusters, sims=None, weight_col='weights')

Function to cluster position matrices. A distance matrix between the matrices is computed using the normalized Pearson correlation metric and agglomerative clustering is used to find clusters. choose_representative_pm is called to identify a representative matrix from each cluster.

Parameters:
  • df (pandas df) – Dataframe containing position matrix values and IDs.

  • n_clusters (int) – Number of clusters

  • sims (np.array) – pairwise similarities between all matrices in pwms

  • weight_col (str) – column in pwms that contains matrix values.

Returns:

dictionary containing cluster labels, representative

matrix IDs, and minimum pairwise similarity within each cluster

Return type:

result

smeagol.matrices.cos_sim(a, b)

Function to calculate the cosine similarity between two vectors.

Parameters:
  • a (np.array) – 1-D arrays.

  • b (np.array) – 1-D arrays.

Returns:

cosine similarity between a and b.

Return type:

result (float)

smeagol.matrices.entropy(probs)

Function to calculate the entropy of a PPM or column of a PPM. Entropy is calculated using the formula H = -sum(p*log2(p)).

Parameters:

probs (np.array) – Numpy array containing probability values.

Returns:

Entropy value

Return type:

result (float)

smeagol.matrices.matrix_correlation(X, Y)

Function to calculate the correlation between two equal-sized matrices. The values in each matrix are unrolled into a 1-D array and the Pearson correlation of the two 1-D arrays is returned.

Parameters:
  • X (np.array) – two numpy arrays with same shape

  • Y (np.array) – two numpy arrays with same shape

Returns:

Pearson correlation between X and Y

(unrolled into 1-D arrays)

Return type:

corr (float)

Raises:

ValueError – if X and Y do not have equal shapes.

smeagol.matrices.ncorr(X, Y, min_overlap=None)

Function to calculate the normalized Pearson correlation between two position matrices, as defined in doi: 10.1093/nar/gkx314.

Inputs:

X, Y (np.array): two position matrices. min_overlap (int): minimum overlap allowed between matrices

Returns:

normalized Pearson correlation value

Return type:

result (float)

smeagol.matrices.normalize_pm(pm)

Function to normalize a position matrix so that all rows sum to 1.

Parameters:

pm (np.array) – Numpy array containing probability values

Returns:

Numpy array with all rows normalized to sum to 1.

smeagol.matrices.order_pms(X, Y)

Function to order two matrices by width.

Inputs:

X, Y (np.array): two position matrices.

Returns:

two position matrices ordered by width, X being smaller.

Return type:

X, Y (np.array)

smeagol.matrices.pairwise_ncorrs(mats)

Function to calculate all pairwise normalized Pearson correlations between a list of position matrices.

Parameters:

mats (list) – a list of position matrices.

Returns:

All pairwise normalized Pearson correlations between matrices in mats.

Return type:

ncorrs (np.array)

smeagol.matrices.pfm_to_ppm(freqs, pseudocount=0.1)

Function to convert a valid PFM into a PPM. The matrix is normalized so that every position sums to 1, thus converting from a matrix of frequencies to a matrix of probabilities. To avoid zeros, a value of pseudocount/4 is first added to each position before normalization.

Parameters:
  • freqs (np.array) – array containing PFM values

  • pseudocount (float) – pseudocount to add

Returns:

Numpy array containing PPM.

smeagol.matrices.position_wise_ic(probs)

Function to calculate the information content (IC) of each position in a PPM. The IC for a specific position is calculated using the formula: IC = 2 + sum(p*log2(p)), assuming a background probability of 0.25 for each of the 4 bases.

Parameters:

probs (np.array) – Numpy array containing PPM probability values.

Returns:

Numpy array containing the calculated

information content of each column in probs.

Return type:

result (np.array)

smeagol.matrices.ppm_to_pwm(probs)

Function to convert a valid PPM into a PWM, using the formula: PWM = log2(PPM/B), where the background probability B is set to 0.25.

Parameters:

probs (np.array) – Numpy array containing PPM probability values

Returns:

Numpy array containing the PWM. The shape of this array is (L, 4) where L is the PWM length.

smeagol.matrices.pwm_to_ppm(weights)
smeagol.matrices.trim_ppm(probs, frac_threshold)

Function to trim non-informative positions from the ends of a PPM. See position_wise_ic for how information content of each position is calculated.

Parameters:
  • probs (np.array) – Numpy array containing PPM probability values

  • frac_threshold (float) – threshold between 0 and 1. The mean information content (IC) of all columns in the matrix will be calculated and any continuous columns in the beginning and end of the matrix that have IC < frac_threshold * mean IC will be dropped.

Returns:

Numpy array containing trimmed PPM.

Return type:

result (np.array)

smeagol.models module

smeagol.scan module

smeagol.scan.count_in_sliding_windows(sites, genome, matrix_id, width, shift=None)

Function to count binding sites in sliding windows across a genome.

Parameters:
  • sites (pd.DataFrame) – dataframe containing locations of binding sites

  • genome (list) – list of SeqRecord objects

  • matrix_id (str) – selected PWM

  • width (int) – width of tiling windows

  • shift (int) – shift between successive windows

Returns:

dataframe containing number of

binding sites per window

Return type:

results (pd.DataFrame)

smeagol.scan.scan_sequences(seqs, model, threshold, sense='+', rcomp='none', outputs=['sites'], score=False, group_by=None, combine_seqs=False, sep_ids=False, seq_batch=0)

Encode given sequences and predict binding sites on them.

Parameters:
  • seqs (list / str) – list of seqrecord objects or fasta file

  • model (model) – class PWMModel

  • threshold (float or np.arange) – threshold (from 0 to 1) to identify binding sites OR np.arange (with binned_counts=True).

  • sense (str) – sense of sequence(s), ‘+’ or ‘-‘.

  • rcomp (str) – ‘only’ to encode the sequence reverse complement, ‘both’ to encode the reverse complement as well as original sequence, or ‘none’

  • outputs (list) – List containing the desired outputs - any combination of ‘sites’, ‘counts’, ‘binned_counts’ and ‘stats’. For example: [‘sites’, ‘counts’]. ‘sites’ outputs binding site locations; ‘counts’ outputs the total count of binding sites per PWM; ‘binned_counts’ outputs the counts of binding sites per PWM binned by score; ‘stats’ outputs the mean and standard deviation of the number of binding sites per PWM across multiple sequences.

  • score (bool) – output scores for binding sites. Only relevant if ‘sites’ is specified.

  • group_by (str) – key by which to group sequences. If None, each sequence will be a separate group.

  • combine_seqs (bool) – combine outputs for multiple sequence groups into a single dataframe

  • sep_ids (bool) – separate outputs by sequence ID.

  • seq_batch (int) – number of sequences to scan at a time. If 0, scan all.

Returns:

dictionary containing specified outputs.

Return type:

output (dict)

smeagol.scan.score_site(pwm, seq, position_wise=False)

Function to score a binding site sequence using a PWM.

Parameters:
  • pwm (np.array) – Numpy array containing the PWM weights

  • seq (str) – A sequence the same length as the PWM.

Returns:

PWM match score

Return type:

score (float)

smeagol.utils module

smeagol.utils.get_Seq(seq)

Function to convert a sequence into a Seq object.

Parameters:

seq (str, Seq or SeqRecord) – sequence

Returns:

Seq object

smeagol.utils.get_site_seq(site, seqs)

Function to extract the sequence for a binding site.

Parameters:
  • site – dataframe containing name, start and end.

  • seqs (list / str) – list of seqrecord objects or fasta file.

Returns:

sequence of binding site

Return type:

result (str)

smeagol.utils.get_tiling_windows_over_genome(genome, width, shift=None)

Function to get tiling windows over a genome.

Parameters:
  • genome (list) – list of SeqRecord objects

  • width (int) – width of tiling windows

  • shift (int) – shift between successive windows. By default the same as width.

Returns:

windows covering all sequences in genome

Return type:

windows (pd.DataFrame)

smeagol.utils.read_bg_seqs(file, records, simN)

Function to read background sequences from FASTA file.

Parameters:
  • file (str) – Path to fasta (or fasta.gz) file

  • records (list) – Original un-shuffled records

  • simN (int) – Number of shuffles

Returns:

list of seqrecord objects

smeagol.utils.shuffle_records(records, simN, simK=2, out_file=None, seed=1)

smeagol.variant module

smeagol.variant.variant_effect_on_sites(sites, variants, seqs, pwms)

Function to predict variant effect on sites.

Parameters:
  • sites (pd.DataFrame) – dataframe containing information on each binding site to analyze. Contains columns ‘name’ (sequence name), ‘matrix_id’ (ID of the PWM that matched to the site), ‘start’ (site position) and ‘end’ (site end).

  • variants (pd.DataFrame) – dataframe containing information on each variant to analyze. Contains columns ‘name’ (sequence name), ‘pos’ (variant position), and ‘alt’ (alternate base at variant position).

  • seqs (list / str) – list of seqrecord objects or fasta file.

  • pwms (pd.DataFrame) – dataframe containing information on PWMs. Contains columns ‘matrix_id’ and ‘weights’.

Returns:

dataframe containing information on

variant effects.

Return type:

effects (pd.DataFrame)

smeagol.visualize module

smeagol.visualize.plot_background(enrichment_result, Matrix_ids, sense, figsize=(17, 8), ncols=4, file_path=None)

Function to plot the distribution of background counts.

Parameters:
  • enrichment_result (dict) – Dictionary with result from enrich_in_genome

  • Matrix_ids (list) – IDs of PWMs to plot

  • sense (str) – ‘+’ or ‘-’

  • figsize (tuple) – total figure size

  • ncols (int) – number of columns for figure panels

  • file_path (str) – path to save figure.

Returns:

Plot of motif distribution in real vs. background sequences.

smeagol.visualize.plot_binned_count_dist(real_preds, Matrix_id, sense, shuf_preds=None, rounding=3, file_path=None)

Function to plot distribution of (fractional) binding site scores in real vs. shuffled genomes

Parameters:
  • real_preds (pd.DataFrame) – DF containing Matrix_id, sense, bin

  • shuf_preds (pd.DataFrame) – DF containing Matrix_id, sense, bin

  • Matrix_id (str) – ID of PWM for which to plot distribution

  • sense (str) – sense of strand for which to plot distribution. If not given, both strands are used.

  • file_path (str) – path to save figure

Returns:

Plot of binding site counts binned by score.

smeagol.visualize.plot_clustermap(foldchanges, pvalues, threshold=0.05, row_cluster=True, dendogram=True, file_path=None)

Simple function to plot clustermap of foldchanges and pwms (without annotation); function will filter foldchanges whenever pvalues are significant.

Parameters:
  • foldchanges (np.array) – matrix of foldchanges for enrichment and depletion of pwms

  • pvalues (np.array) – matrix of pvalues for enrichment of the same pwms

  • threshold (float) – threshold for pvalues. default 0.05

  • row_cluster (bool) – Cluster rows. Default True

  • dendogram (bool) – Plot dendograms. Default True

  • file_path (str) – The path to the file into which the plot should be written.

Returns:

heatmap plot.

smeagol.visualize.plot_clustermap_annot(foldchanges, pvalues, annot, column, threshold=0.05, group_by=None, drop_zeros=False, plot=None, min_occ=3, cluster=True, col_cluster=True, row_cluster=True, dendogram=True, col_dendogram=True, row_dendogram=True, file_path=None)

Function to plot clustermap of foldchanges and pwms, with annotation and more parameters

Parameters:
  • foldchanges (np.array) – matrix of foldchanges for enrichment and depletion of pwms

  • pvalues (np.array) – matrix of pvalues for enrichment of the same pwms

  • annot (pd.DataFrame) – annotation table with information on each column of pwm matrix

  • column (str) – which column contains the name to be used as labels on the plot

  • threshold (float) – threshold for pvalues. default 0.05

  • group_by (str) – extra column in annot table that groups the occurrences (ex. Genus, Family, Kingdom, Host, etc) (Default: None). If no group_by function was provided but column parameter is “Species”, we try to cluster species by “Genus” using the first word of Species.

  • drop_zeros (bool) – if True, drops all columns and rows in which all values are zero (Default: False)

  • plot (str) – ‘sep’ if you wish to separate the plots according to the groups provided in group_by (Default: None)

  • min_occ (int) – min number of occurrences within a group (defined by group_by) to plot separately if plot=”sep”

  • cluster (bool) – Clusters both columns and rows (Default: True)

  • col_cluster (bool) – Cluster columns (Default: True)

  • row_cluster (bool) – Cluster rows (Default: True)

  • dendogram (bool) – Plot both row and col dendograms (Default: True)

  • row_dendogram (bool) – Plot row dendograms (Default: True)

  • col_dendogram (bool) – Plot column dendograms (Default: True)

  • file_path (str) – The path to the file into which the plot should be written.

Returns:

annotated heatmap plot.

smeagol.visualize.plot_ppm(ppm_df, Matrix_id, height=15)

Function to plot sequence logo from PPM

Parameters:
  • ppm_df (pd.DataFrame) – Dataframe containing cols probs, Matrix_id

  • Matrix_id – ID of PWM to plot

Returns:

Plots PPM

smeagol.visualize.plot_pwm(pwm_df, Matrix_id, figsize=(5, 2))

Function to plot sequence logo from PWM

Parameters:
  • pwm_df (pd.DataFrame) – Dataframe containing cols weight, Matrix_id

  • Matrix_id – ID of PWM to plot (will be used as logo title)

  • figsize (tuple) – (width, height)

Returns:

Plots PWM

Function to visualize the sequence logo of a PPM.

Parameters:
  • probs (np.array) – array containing probabilities

  • title (str) – Title of the logo.

  • figsize (tuple) – (width, height)

Returns:

PPM logo

Function to visualize the sequence logo of a PWM.

Parameters:
  • weights (np.array) – array containing PWM weight values

  • title (str) – Title of the logo.

  • figsize (tuple) – (width, height)

Returns:

PWM logo

smeagol.visualize.sliding_window_count_plot(df, title, cols=3, aspect=1.4, file_path=None)

Function to visualize site counts using sliding windows.

Parameters:
  • df (pd.DataFrame) – A pandas dataframe that contains the data to plot.

  • title (str) – The title to be used for the plot.

  • cols (int) – number of columns for plot

  • aspect (float) – width/height

  • file_path (str) – The path to the file into which the plot should be written.

Returns:

Sliding window plot.

smeagol.visualize.sliding_window_enrichment_plot(sliding_window_df, x_var, y_var, xticklabels, title, fig_size=(12, 5), file_path=None)

Function to visualize enrichment / depletion analysis using sliding windows.

Parameters:
  • sliding_window_df (pd.DataFrame) – A pandas dataframe that contains the data to plot.

  • x_var (str) – The name of the column that should be used as x-axis.

  • y_var (str) – The name of the column that should be used as y-axis.

  • xticklabels (str) – The name of the column that contains the x-axis labels.

  • title (str) – The title to be used for the plot.

  • file_path (str) – The path to the file into which the plot should be written.

Returns:

Sliding window plot.

Module contents