smeagol package¶

Submodules¶

smeagol.aggregate module¶

smeagol.aggregate.collect_data(file_pattern, file_ids, col_with_index, col_to_collect)¶

This function is used to aggregate multiple SMEAGOL result files, for example on different genomes. It collects a specific column (col_to_collect) from all the result files and merges these columns into a single dataframe. The merge is based on a specific index column (col_with_index) that should be present in all files.

Parameters:

file_pattern (str) – pattern to search for in filenames
file_ids (list) – list of IDs, one per file
col_with_index (str) – Name of the column to set as index
col_to_collect (str) – Name of the column to collect

Returns:

A concatenated dataframe containing the: col_to_collect column from all files matching the pattern file_pattern.

Return type:

df_all (pandas DataFrame)

smeagol.encode module¶

class smeagol.encode.SeqEncoding(records, rcomp='none', sense=None)¶

Bases: object

This class encodes a single nucleic acid sequence, or a set of sequences, all of which must have the same length. Genomic sequences used to initialize the class must also have the same sense.

len¶

Length of the sequences

Type:: int

ids¶

Numpy array containing the IDs of all sequences.

Type:: np.array

names¶

Numpy array containing the names of all sequences.

Type:: np.array

seqs¶

Numpy array containing the integer encoded sequences.

Type:: np.array

senses¶

Numpy array containing the senses (‘+’ or ‘-’) of all sequences.

Type:: np.array

class smeagol.encode.SeqGroups(records, rcomp='none', sense=None, group_by=None)¶

Bases: object

This class encodes one or more groups of equal-length nucleic acid sequences. Sequences in different groups may have different lengths.

smeagol.encode.integer_encode(seq, rc=False)¶

Function to encode a nucleic acid sequence as a sequence of integers.

Parameters:

seq (str, Seq or SeqRecord object) – Nucleic acid sequence to encode.
A (Allowed characters are) –
C –
G –
T –
U –
Z –
N –
W –
S –
M –
K –
R –
Y –
B –
D –
H –
V. –
rc (bool) – If True, reverse complement the sequence before encoding

Returns:

Numpy array containing the integer encoded sequence. The shape is (L, ) where L is the length of the sequence.

Return type:

result (np.array)

smeagol.encode.one_hot_encode(seq, rc=False)¶

Function to one-hot encode a nucleic acid sequence.

Parameters:

seq (str, Seq or SeqRecord object) – Nucleic acid sequence to encode.
A (Allowed characters are) –
C –
G –
T –
U –
Z –
N –
W –
S –
M –
K –
R –
Y –
B –
D –
H –
V. –
rc (bool) – If True, reverse complement the sequence before encoding

Returns:

Numpy array containing the one-hot encoded sequence. The shape is (L, 4) where L is the length of the sequence.

Return type:

result (np.array)

smeagol.enrich module¶

smeagol.enrich.enrich_in_genome(records, model, rcomp, threshold, simN=1000, simK=2, sense='+', background='binomial', verbose=False, combine_seqs=True, seq_batch=0, shuf_seqs=None)¶

Function to calculate enrichment of motifs in given sequence(s) relative to a background. The background sequences can be supplied directly or generated by shuffling the given sequence(s).

Parameters:

records (list) – A list of seqrecord objects containing the sequence(s) to test. This list can be generated by reading a fasta file using ‘smeagol.io.read_fasta’.
model (PWMModel) – An object of class PWMModel for PWM scanning
rcomp (str) – ‘only’ to encode the reverse complements of the given sequences, ‘both’ to encode the reverse complements as well as the given sequences, or ‘none’ to use only the given sequences without reverse complementing.
threshold (float) – Threshold motif match score to use for PWM scanning. A sub-sequence that has a PWM match score higher than this will be counted as a match.
simN (int) – The number of shuffled sequences to generate for the background model.
simK (int) – The k-mer frequency to conserve while shuffling
sense (str) – Sense of the given sequences. Must be ‘+’ or ‘-’
background (str) – Background model for calculating p-values. Must be ‘binomial’ or ‘normal’
combine_seqs (bool) – If true, combine outputs for all supplied sequences into a single output dataframe
seq_batch (int) – The number of shuffled sequences to scan at a time. If 0, scan all at once.
shuf_seqs (str/list) – A path to FASTA file or list of seqrecord objects containing shuffled sequences. Use if loading the background sequences directly instead of generating them by shuffling.

Returns:

dictionary containing results. Contains the following: keys: ‘enrichment’ (enrichment statistics), ‘real_sites’ (locations of motif matches on the given sequences), ‘real_counts’ (number of sites for each PWM on the given sequences), ‘shuf_counts’ (number of sites for each PWM on the background sequences), ‘shuf_stats’ (mean and standard deviation of the number of sites for each PWM on the background sequences).

Return type:

results (dict)

smeagol.enrich.enrich_in_sliding_windows(sites, genome, width, shift=None, matrix_id=None, alternative='two-sided')¶

Function to test for enrichment of motifs in sliding windows across a genome.

Parameters:

sites (pd.DataFrame) – A pandas dataframe containing the locations of motifs in the genome. Should contain columns id (the sequence ID), start and end.
genome (list) – A list of SeqRecord objects.
width (int) – The width (in bases) of tiling windows to test.
shift (int) – The shift (in bases) between successive windows. If not supplied, non-overlapping windows are generated.
matrix_id (str) – Optional. The ID of the PWM to test. If supplied, it should correspond to an entry in the column Matrix_id of sites.
alternative (str) – Defines the alternative hypothesis. Options are ‘two-sided’, ‘less’ or ‘greater’. Default is ‘two-sided’.

Returns:

A Pandas DataFrame containing the results of enrichment analysis in windows covering all sequences in genome

smeagol.enrich.enrich_in_window(window, sites, genome, matrix_id=None, alternative='two-sided')¶

Function to test for enrichment of motifs in a subsequence relative to a complete sequence. This function uses Fisher’s exact test.

Parameters:

window (pd.DataFrame) – A Pandas dataframe with the following columns: id (sequence ID), start, end
sites (pd.DataFrame) – A pandas dataframe containing the locations of motifs in the genome. Should contain columns id (sequence ID, start, end.
genome (list) – A list of SeqRecord objects
matrix_id (str) – Optional. The ID of the selected PWM. If supplied, it must correspond to an entry in the column ‘Matrix_id’ of sites.
alternative (str) – Defines the alternative hypothesis. Options are ‘two-sided’, ‘less’ or ‘greater’. Default is ‘two-sided’.

Returns:

Pandas dataframe containing the result of: enrichment analysis in selected window.

Return type:

result (pd.DataFrame)

smeagol.enrich.examine_thresholds(records, model, simN=300, simK=2, rcomp='only', min_threshold=0.5, threshold_shift=0.1, sense='+', verbose=False, combine_seqs=True)¶

Function to scan given sequence(s) with PWMs and calculate the number of binding sites predicted using different cutoffs for the match score, for the real sequence as well as shuffled sequences.

Parameters:

records (list) – A list of seqrecord objects containing the sequence(s) to scan.
model (PWMModel) – An object of class PWMModel for PWM scanning
simN (int) – The number of shuffled sequences to generate for the background model. Default is 300.
simK (int) – The k-mer frequency to conserve while shuffling. Default is 2.
rcomp (str) – ‘only’ to encode the sequence reverse complements, ‘both’ to encode the reverse complements as well as original sequences, or ‘none’. Default is ‘only’.
min_threshold (float) – Minimum threshold motif match score to use for PWM scanning. Default is 0.5.
threshold_shift (float) – Difference between successive thresholds to compare. Default is 0.1.
sense (str) – Sense of the given sequences. Must be ‘+’ or ‘-’. Default is ‘+’.
verbose (bool) – If true, output all information
combine_seqs (bool) – If true, combine outputs for all supplied sequences into a single output dataframe

Returns:

dictionary containing results. Contains the following: keys: ‘real_binned’ (Number of motif matches for each PWM in the given sequences, binned by their match score), ‘shuf_binned’ (Number of motif matches for each PWM in the background sequences, binned by their match score)

Return type:

results (dict)

smeagol.io module¶

smeagol.io.load_attract()¶

Function to download all motifs and metadata from ATtRACT and load them as a pandas DF.

Returns:: contains matrices
Return type:: df (pandas df)

smeagol.io.load_rbpdb(species='H.sapiens', matrix_type='PWM')¶

Function to download all motifs and metadata from RBPDB and load them as a pandas DF. Downloads version 1.3.1.

Parameters:

species (str) – ‘H.sapiens’, ‘M.musculus’, ‘D.melanogaster’ or ‘C.elegans’.
matrix_type (str) – ‘PWM’ or ‘PFM’

Returns:

contains matrices

Return type:

df (pandas df)

smeagol.io.load_smeagol_PWMset(dataset='representative')¶

Function to download the curated set of motifs used in the SMEAGOL paper and load them as a pandas dataframe.

Parameters:: dataset – ‘full’ or ‘representative’.
Returns:: contains matrices
Return type:: df (pandas df)

smeagol.io.read_fasta(file)¶

Function to read sequences from a fasta or fasta.gz file.

Parameters:: file (str) – path to file
Returns:: list of seqRecord objects, one for each sequence in the fasta file.
Return type:: records (list)

smeagol.io.read_pms_from_dir(dirname, matrix_type='PPM', transpose=False)¶

Function to read position matrices from a directory with separate files for each PM.

The input directory is expected to contain individual files each of which represents a separate matrix.

The file name should be the name of the PWM followed by an extension. If _ is included in the name, the characters after the _ will be dropped.

Matrices are by default expected to be in the position x base format, i.e. one row per position and one column per base (A, C, G, T). If the matrices are instead in the base x position format, set transpose=True.

Parameters:

dirname (str) – path to the folder containing PMs in individual files
matrix_type (str) – ‘PWM’, ‘PPM’ (default) or ‘PFM’
transpose (bool) – If true, transpose the matrix

Returns:

pandas dataframe containing PMs

smeagol.io.read_pms_from_file(file, matrix_type='PPM', check_lens=False, transpose=False, delimiter='\t')¶

Function to read position matrices from a FASTA-like file.

The input file is expected to follow the format:

>Matrix_1_ID matrix values >Matrix_2_ID matrix values

Matrices are by default expected to be in the position x base format, i.e. one row per position and one column per base (A, C, G, T). If the matrices are instead in the base x position format, set transpose=True.

Optionally, the ID rows may also contain a second field indicating the length of the matrix, for example:

>Matrix_1_ID<tab>7

The pwm.txt file downloaded from the Attract database follows this format. In this case, you can set check_lens=True to confirm that the loaded PWMs match the expected lengths.

Parameters:

pm_file (str) – path to the file containing PMs
matrix_type (str) – ‘PWM’, ‘PPM’ (default) or ‘PFM’
check_lens (bool) – check that the matrix lengths match the lengths provided in the file
transpose (bool) – transpose the matrices
delimiter (str) – The string used to separate values in the file

Returns:

pandas dataframe containing PMs

smeagol.io.write_fasta(records, file, gz=True)¶

Function to write sequences to a fasta or fasta.gz file.

Params:: records (list): list of seqRecord objects to write. file (str): path to file gz (bool): If true, compress the output file with gzip.

Returns:: Writes records to the file

smeagol.matrices module¶

smeagol.matrices.align_pms(X, Y, min_overlap=None)¶

Function to generate all possible overlaps between two position matrices.

Inputs:: X, Y (np.array): two position matrices. min_overlap (int): minimum overlap allowed between matrices

Returns:

tuples containing matrix start and end positions: corresponding to each possible alignment of the two matrices.

Return type:

result

smeagol.matrices.check_pfm(freqs, warn=False)¶

Function to check the validity of a PFM. A valid PFM should have non-negative integer values and exactly 4 columns.

Parameters:: freqs (np.array) – Numpy array containing frequency values. The array should be 2-dimensional and have 4 columns.
Raises:: ValueError – if freqs is an invalid PFM.

smeagol.matrices.check_ppm(probs, warn=False, eps=0.01)¶

Function to check the validity of a PPM. A valid PPM should have all values within the range [0,1], all rows summing to 1 (within a tolerance of eps), and exactly 4 columns.

Parameters:: probs (np.array) – Numpy array containing probability values. The array should be 2-dimensional and have 4 columns.
Raises:: ValueError – if probs is an invalid PPM.

smeagol.matrices.check_pwm(weights)¶

Function to check the validity of a PWM. A valid PWM should have exactly 4 columns.

Parameters:: weights (np.array) – Numpy array containing PWM weights. The array should be 2-dimensional and have 4 columns.
Raises:: ValueError – if weights is an invalid PWM.

smeagol.matrices.choose_cluster_representative_pms(df, sims=None, clusters=None, maximize='median', weight_col='weights', matrix_type='PWM')¶

Function to choose a representative position matrix from each cluster. This performs choose_representative_pm to choose a representative matrix within each supplied cluster.

Parameters:

df (pandas df) – Dataframe containing position matrix values and IDs.
sims (np.array) – pairwise similarities between all PWMs in pwms
clusters (list) – cluster assignments for each PWM.
maximize (str) – ‘mean’ or ‘median’. Metric to choose representative matrix.
weight_col (str) – column in pwms that contains matrix values.
matrix_type (str) – PPM or PWM

Returns:

IDs for the selected representative matrices.

Return type:

representatives (list)

smeagol.matrices.choose_representative_pm(df, sims=None, maximize='median', weight_col='weights', matrix_type='PWM')¶

Function to choose a representative position matrix from a group. If 2 matrices are supplied, the one with lower entropy is chosen. For larger sets of matrices, the matrix with the highest mean (or median) normalized Pearson correlation to all other matrices is chosen.

Parameters:

df (pandas df) – Dataframe containing position matrix values and IDs. The ID column should be named ‘Matrix_id’.
sims (np.array) – pairwise similarities between all PWMs in df
maximize (str) – ‘mean’ or ‘median’. Metric to choose representative matrix.
weight_col (str) – the column in df that contains matrix values.
matrix_type (str) – PPM or PWM

Returns:

IDs for the selected representative matrices.

Return type:

result (list)

smeagol.matrices.cluster_pms(df, n_clusters=None, similarity=None, sims=None, weight_col='weights')¶

Function to cluster position matrices. A distance matrix between the matrices is computed using the normalized Pearson correlation metric and agglomerative clustering is used to find clusters. choose_representative_pm is called to identify a representative matrix from each cluster. Set either number of clusters or similarity threshold.

Parameters:

df (pandas df) – Dataframe containing position matrix values and IDs.
n_clusters (int) – Number of clusters
similarity (float) – minumum within-cluster similarity threshold [0-1]
sims (np.array) – pairwise similarities between all matrices in pwms
weight_col (str) – column in pwms that contains matrix values.

Returns:

dictionary containing cluster labels, representative: matrix IDs, and minimum pairwise similarity within each cluster

Return type:

result

smeagol.matrices.cos_sim(a, b)¶

Function to calculate the cosine similarity between two vectors.

Parameters:

a (np.array) – 1-D arrays.
b (np.array) – 1-D arrays.

Returns:

cosine similarity between a and b.

Return type:

result (float)

smeagol.matrices.entropy(probs)¶

Function to calculate the entropy of a PPM or column of a PPM. Entropy is calculated using the formula H = -sum(p*log2(p)).

Parameters:: probs (np.array) – Numpy array containing probability values.
Returns:: Entropy value
Return type:: result (float)

smeagol.matrices.matrix_correlation(X, Y)¶

Function to calculate the correlation between two equal-sized matrices. The values in each matrix are unrolled into a 1-D array and the Pearson correlation of the two 1-D arrays is returned.

Parameters:

X (np.array) – two numpy arrays with same shape
Y (np.array) – two numpy arrays with same shape

Returns:

Pearson correlation between X and Y: (unrolled into 1-D arrays)

Return type:

corr (float)

Raises:

ValueError – if X and Y do not have equal shapes.

smeagol.matrices.ncorr(X, Y, min_overlap=None)¶

Function to calculate the normalized Pearson correlation between two position matrices, as defined in doi: 10.1093/nar/gkx314.

Inputs:: X, Y (np.array): two position matrices. min_overlap (int): minimum overlap allowed between matrices

Returns:: normalized Pearson correlation value
Return type:: result (float)

smeagol.matrices.normalize_pm(pm)¶

Function to normalize a position matrix so that all rows sum to 1.

Parameters:: pm (np.array) – Numpy array containing probability values
Returns:: Numpy array with all rows normalized to sum to 1.

smeagol.matrices.order_pms(X, Y)¶

Function to order two matrices by width.

Inputs:: X, Y (np.array): two position matrices.

Returns:: two position matrices ordered by width, X being smaller.
Return type:: X, Y (np.array)

smeagol.matrices.pairwise_ncorrs(mats)¶

Function to calculate all pairwise normalized Pearson correlations between a list of position matrices.

Parameters:: mats (list) – a list of position matrices.
Returns:: All pairwise normalized Pearson correlations between matrices in mats.
Return type:: ncorrs (np.array)

smeagol.matrices.pfm_to_ppm(freqs, pseudocount=0.1)¶

Function to convert a valid PFM into a PPM. The matrix is normalized so that every position sums to 1, thus converting from a matrix of frequencies to a matrix of probabilities. To avoid zeros, a value of pseudocount/4 is first added to each position before normalization.

Parameters:

freqs (np.array) – array containing PFM values
pseudocount (float) – pseudocount to add

Returns:

Numpy array containing PPM.

smeagol.matrices.position_wise_ic(probs)¶

Function to calculate the information content (IC) of each position in a PPM. The IC for a specific position is calculated using the formula: IC = 2 + sum(p*log2(p)), assuming a background probability of 0.25 for each of the 4 bases.

Parameters:

probs (np.array) – Numpy array containing PPM probability values.

Returns:

Numpy array containing the calculated: information content of each column in probs.

Return type:

result (np.array)

smeagol.matrices.ppm_to_pwm(probs)¶

Function to convert a valid PPM into a PWM, using the formula: PWM = log2(PPM/B), where the background probability B is set to 0.25.

Parameters:: probs (np.array) – Numpy array containing PPM probability values
Returns:: Numpy array containing the PWM. The shape of this array is (L, 4) where L is the PWM length.

smeagol.matrices.pwm_to_ppm(weights)¶

smeagol.matrices.reverse_complement(mat)¶

Function to get the reverse complement of a matrix. The reverse complement is calculated by reversing the matrix and swapping the columns.

Parameters:

mat (np.array) – Numpy array containing motif values.
[A (Expected column order is) –
C –
G –
T]. –

Returns:

Numpy array containing the reverse: complement of the input matrix.

Return type:

result (np.array)

smeagol.matrices.trim_ppm(probs, frac_threshold)¶

Function to trim non-informative positions from the ends of a PPM. See position_wise_ic for how information content of each position is calculated.

Parameters:

probs (np.array) – Numpy array containing PPM probability values
frac_threshold (float) – threshold between 0 and 1. The mean information content (IC) of all columns in the matrix will be calculated and any continuous columns in the beginning and end of the matrix that have IC < frac_threshold * mean IC will be dropped.

Returns:

Numpy array containing trimmed PPM.

Return type:

result (np.array)

smeagol.models module¶

class smeagol.models.PWMModel(pwm_df)¶

Bases: object

Class to contain a convolutional model used for PWM scanning.

Parameters:

Matrix_ids (np.array) – Numpy array containing the IDs of all the PWMs encoded in the model.
widths (np.array) – Numpy array containing the widths of all the PWMs encoded in the model.
max_scores (np.array) – Numpy array containing the maximum possible score for each PWM encoded in the model.
channels (int) – Number of PWMs encoded in the model.
max_width (int) – Width of the longest PWM encoded in the model.

predict(seqs)¶

Scan encoded sequences with the model.

Parameters:: seqs (np.array) – Numpy array containing integer encoded sequences.

predict_with_threshold(seqs, frac_threshold, score=False, seq_batch=0)¶

Scan encoded sequences in batches with the model and threshold the returned values.

Parameters:

seqs (list or np.array) – strings or integer encoded sequences
frac_threshold (float) – fraction of maximum score to use as binding site detection threshold
score (bool) – Output binding site scores as well as positions
seq_batch (int) – number of sequences to scan at a time. If 0, scan all.

Returns:

positions where the input sequence(s) match: the input PWM(s) with a score above the specified threshold.

scores (np.array): score for each potential binding site

Return type:

thresholded (np.array)

set_model_weights(weights)¶

Function to fix the weights of the conv model using PWM weights.

Parameters:: weights (np.array) – Numpy array containing PWM weights.

smeagol.scan module¶

smeagol.scan.count_in_sliding_windows(sites, genome, matrix_id, width, shift=None)¶

Function to count binding sites in sliding windows across a genome.

Parameters:

sites (pd.DataFrame) – dataframe containing locations of binding sites
genome (list) – list of SeqRecord objects
matrix_id (str) – selected PWM
width (int) – width of tiling windows
shift (int) – shift between successive windows

Returns:

dataframe containing number of: binding sites per window

Return type:

results (pd.DataFrame)

smeagol.scan.scan_sequences(seqs, model, threshold, sense='+', rcomp='none', outputs=['sites'], score=False, group_by=None, combine_seqs=False, sep_ids=False, seq_batch=0)¶

Encode given sequences and predict binding sites on them.

Parameters:

seqs (list / str) – list of seqrecord objects or fasta file
model (model) – class PWMModel
threshold (float or np.arange) – threshold (from 0 to 1) to identify binding sites OR np.arange (with binned_counts=True).
sense (str) – sense of sequence(s), ‘+’ or ‘-‘.
rcomp (str) – ‘only’ to encode the sequence reverse complement, ‘both’ to encode the reverse complement as well as original sequence, or ‘none’
outputs (list) – List containing the desired outputs - any combination of ‘sites’, ‘counts’, ‘binned_counts’ and ‘stats’. For example: [‘sites’, ‘counts’]. ‘sites’ outputs binding site locations; ‘counts’ outputs the total count of binding sites per PWM; ‘binned_counts’ outputs the counts of binding sites per PWM binned by score; ‘stats’ outputs the mean and standard deviation of the number of binding sites per PWM across multiple sequences.
score (bool) – output scores for binding sites. Only relevant if ‘sites’ is specified.
group_by (str) – key by which to group sequences. If None, each sequence will be a separate group.
combine_seqs (bool) – combine outputs for multiple sequence groups into a single dataframe
sep_ids (bool) – separate outputs by sequence ID.
seq_batch (int) – number of sequences to scan at a time. If 0, scan all.

Returns:

dictionary containing specified outputs.

Return type:

output (dict)

smeagol.scan.score_site(pwm, seq, position_wise=False)¶

Function to score a binding site sequence using a PWM.

Parameters:

pwm (np.array) – Numpy array containing the PWM weights
seq (str) – A sequence the same length as the PWM.

Returns:

PWM match score

Return type:

score (float)

smeagol.utils module¶

smeagol.utils.get_Seq(seq)¶

Function to convert a sequence into a Seq object.

Parameters:: seq (str, Seq or SeqRecord) – sequence
Returns:: Seq object

smeagol.utils.get_site_seq(site, seqs)¶

Function to extract the sequence for a binding site.

Parameters:

site – dataframe containing name, start and end.
seqs (list / str) – list of seqrecord objects or fasta file.

Returns:

sequence of binding site

Return type:

result (str)

smeagol.utils.get_tiling_windows_over_genome(genome, width, shift=None)¶

Function to get tiling windows over a genome.

Parameters:

genome (list) – list of SeqRecord objects
width (int) – width of tiling windows
shift (int) – shift between successive windows. By default the same as width.

Returns:

windows covering all sequences in genome

Return type:

windows (pd.DataFrame)

smeagol.utils.read_bg_seqs(file, records, simN)¶

Function to read background sequences from FASTA file.

Parameters:

file (str) – Path to fasta (or fasta.gz) file
records (list) – Original un-shuffled records
simN (int) – Number of shuffles

Returns:

list of seqrecord objects

smeagol.utils.shuffle_records(records, simN, simK=2, out_file=None, seed=1)¶

Function to shuffle sequences. :param records: list of seqRecord objects :type records: list :param simN: Number of times to shuffle :type simN: int :param simK: k-mer frequency to conserve, either 1 or 2 :type simK: int :param out_file: Path to output file (optional) :type out_file: str :param seed: Random seed :type seed: int

Returns:: list of shuffled sequences Writes shuf_records to file if out_file provided.
Return type:: shuf_records (list)

smeagol.variant module¶

smeagol.variant.variant_effect_on_sites(sites, variants, seqs, pwms)¶

Function to predict variant effect on sites.

Parameters:

sites (pd.DataFrame) – dataframe containing information on each binding site to analyze. Contains columns ‘name’ (sequence name), ‘matrix_id’ (ID of the PWM that matched to the site), ‘start’ (site position) and ‘end’ (site end).
variants (pd.DataFrame) – dataframe containing information on each variant to analyze. Contains columns ‘name’ (sequence name), ‘pos’ (variant position), and ‘alt’ (alternate base at variant position).
seqs (list / str) – list of seqrecord objects or fasta file.
pwms (pd.DataFrame) – dataframe containing information on PWMs. Contains columns ‘matrix_id’ and ‘weights’.

Returns:

dataframe containing information on: variant effects.

Return type:

effects (pd.DataFrame)

smeagol.visualize module¶

smeagol.visualize.plot_background(enrichment_result, Matrix_ids, sense, figsize=(17, 8), ncols=4, file_path=None)¶

Function to plot the distribution of background counts.

Parameters:

enrichment_result (dict) – Dictionary with result from enrich_in_genome
Matrix_ids (list) – IDs of PWMs to plot
sense (str) – ‘+’ or ‘-’
figsize (tuple) – total figure size
ncols (int) – number of columns for figure panels
file_path (str) – path to save figure.

Returns:

Plot of motif distribution in real vs. background sequences.

smeagol.visualize.plot_binned_count_dist(real_preds, Matrix_id, sense, shuf_preds=None, rounding=3, file_path=None)¶

Function to plot distribution of (fractional) binding site scores in real vs. shuffled genomes

Parameters:

real_preds (pd.DataFrame) – DF containing Matrix_id, sense, bin
shuf_preds (pd.DataFrame) – DF containing Matrix_id, sense, bin
Matrix_id (str) – ID of PWM for which to plot distribution
sense (str) – sense of strand for which to plot distribution. If not given, both strands are used.
file_path (str) – path to save figure

Returns:

Plot of binding site counts binned by score.

smeagol.visualize.plot_clustermap(foldchanges, pvalues, threshold=0.05, row_cluster=True, dendogram=True, file_path=None)¶

Simple function to plot clustermap of foldchanges and pwms (without annotation); function will filter foldchanges whenever pvalues are significant.

Parameters:

foldchanges (np.array) – matrix of foldchanges for enrichment and depletion of pwms
pvalues (np.array) – matrix of pvalues for enrichment of the same pwms
threshold (float) – threshold for pvalues. default 0.05
row_cluster (bool) – Cluster rows. Default True
dendogram (bool) – Plot dendograms. Default True
file_path (str) – The path to the file into which the plot should be written.

Returns:

heatmap plot.

smeagol.visualize.plot_clustermap_annot(foldchanges, pvalues, annot, column, threshold=0.05, group_by=None, drop_zeros=False, plot=None, min_occ=3, cluster=True, col_cluster=True, row_cluster=True, dendogram=True, col_dendogram=True, row_dendogram=True, file_path=None)¶

Function to plot clustermap of foldchanges and pwms, with annotation and more parameters

Parameters:

foldchanges (np.array) – matrix of foldchanges for enrichment and depletion of pwms
pvalues (np.array) – matrix of pvalues for enrichment of the same pwms
annot (pd.DataFrame) – annotation table with information on each column of pwm matrix
column (str) – which column contains the name to be used as labels on the plot
threshold (float) – threshold for pvalues. default 0.05
group_by (str) – extra column in annot table that groups the occurrences (ex. Genus, Family, Kingdom, Host, etc) (Default: None). If no group_by function was provided but column parameter is “Species”, we try to cluster species by “Genus” using the first word of Species.
drop_zeros (bool) – if True, drops all columns and rows in which all values are zero (Default: False)
plot (str) – ‘sep’ if you wish to separate the plots according to the groups provided in group_by (Default: None)
min_occ (int) – min number of occurrences within a group (defined by group_by) to plot separately if plot=”sep”
cluster (bool) – Clusters both columns and rows (Default: True)
col_cluster (bool) – Cluster columns (Default: True)
row_cluster (bool) – Cluster rows (Default: True)
dendogram (bool) – Plot both row and col dendograms (Default: True)
row_dendogram (bool) – Plot row dendograms (Default: True)
col_dendogram (bool) – Plot column dendograms (Default: True)
file_path (str) – The path to the file into which the plot should be written.

Returns:

annotated heatmap plot.

smeagol.visualize.plot_ppm(ppm_df, Matrix_id, figsize=(5, 2))¶

Function to plot sequence logo from PPM

Parameters:

ppm_df (pd.DataFrame) – Dataframe containing cols probs, Matrix_id
Matrix_id – ID of PWM to plot

Returns:

Plots PPM

smeagol.visualize.plot_pwm(pwm_df, Matrix_id, figsize=(5, 2))¶

Function to plot sequence logo from PWM

Parameters:

pwm_df (pd.DataFrame) – Dataframe containing cols weight, Matrix_id
Matrix_id – ID of PWM to plot (will be used as logo title)
figsize (tuple) – (width, height)

Returns:

Plots PWM

smeagol.visualize.ppm_logo(probs, title='', figsize=(5, 2))¶

Function to visualize the sequence logo of a PPM.

Parameters:

probs (np.array) – array containing probabilities
title (str) – Title of the logo.
figsize (tuple) – (width, height)

Returns:

PPM logo

smeagol.visualize.pwm_logo(weights, title='', figsize=(5, 2))¶

Function to visualize the sequence logo of a PWM.

Parameters:

weights (np.array) – array containing PWM weight values
title (str) – Title of the logo.
figsize (tuple) – (width, height)

Returns:

PWM logo

smeagol.visualize.sliding_window_count_plot(df, title, cols=3, aspect=1.4, file_path=None)¶

Function to visualize site counts using sliding windows.

Parameters:

df (pd.DataFrame) – A pandas dataframe that contains the data to plot.
title (str) – The title to be used for the plot.
cols (int) – number of columns for plot
aspect (float) – width/height
file_path (str) – The path to the file into which the plot should be written.

Returns:

Sliding window plot.

smeagol.visualize.sliding_window_enrichment_plot(sliding_window_df, x_var, y_var, xticklabels, title, fig_size=(12, 5), file_path=None)¶

Function to visualize enrichment / depletion analysis using sliding windows.

Parameters:

sliding_window_df (pd.DataFrame) – A pandas dataframe that contains the data to plot.
x_var (str) – The name of the column that should be used as x-axis.
y_var (str) – The name of the column that should be used as y-axis.
xticklabels (str) – The name of the column that contains the x-axis labels.
title (str) – The title to be used for the plot.
file_path (str) – The path to the file into which the plot should be written.

Returns:

Sliding window plot.

smeagol package¶

Submodules¶

smeagol.aggregate module¶

smeagol.encode module¶

smeagol.enrich module¶

smeagol.io module¶

smeagol.matrices module¶

smeagol.models module¶

smeagol.scan module¶

smeagol.utils module¶

smeagol.variant module¶

smeagol.visualize module¶

Module contents¶

SMEAGOL

Navigation

Related Topics