API

cisTarget

DEM

class pycistarget.motif_enrichment_dem.DEM(dem_db, region_sets: Dict[str, PyRanges], specie: str, subset_motifs: List[str] | None = None, contrasts: str | List | None = 'Other', name: str | None = 'DEM', max_bg_regions: int | None = None, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 1, mean_fg_thr: float | None = 0, motif_hit_thr: float | None = None, n_cpu: int | None = 1, fraction_overlap: float = 0.4, cluster_buster_path: str | None = None, path_to_genome_fasta: str | None = None, path_to_motifs: str | None = None, genome_annotation: PyRanges | None = None, promoter_space: int = 1000, path_to_motif_annotations: str | None = None, annotation_version: str = 'v9', motif_annotation: list = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'], motif_similarity_fdr: float = 0.001, orthologous_identity_threshold: float = 0.0, tmp_dir: int | None = None, **kwargs)[source]

DEM class. DEM contains DEM method for motif enrichment analysis on sets of regions.

regions_to_db

A dataframe containing the mapping between query regions and regions in the database.

Type:

pd.DataFrame

region_sets

A dictionary of PyRanges containing region coordinates for the regions to be analyzed.

Type:

Dict

specie

Specie from which genomic coordinates come from.

Type:

str

subset_motifs

List of motifs to disregard in the analysis. Default: None

Type:

List, optional

contrasts

List of contrasts to perform. Default: None (Each group versus all the rest)

Type:

List, optional

name

Analysis name

Type:

str

max_bg_regions

Maximum number of regions to use as background. Default: None (All)

Type:

int, optional

adjpval_thr

Adjusted p-value threshold to consider a motif enriched. Default: 0.05

Type:

float, optional

log2fc_thr

Log2 Fold-change threshold to consider a motif enriched. Default: 1

Type:

float, optional

mean_fg_thr

Minimul mean signal in the foreground to consider a motif enriched. Default: 0

Type:

float, optional

motif_hit_thr

Minimal CRM score to consider a region enriched for a motif. Default: None (It will be automatically calculated based on precision-recall).

Type:

float, optional

n_cpu

Number of cores to use. Default: 1

Type:

int, optional

fraction_overlap

Minimal overlap between query and regions in the database for the mapping.

Type:

float, optional

cluster_buster_path

Path to cluster buster bin. Only required if using a shuffled background. Default: None

Type:

str, optional

path_to_genome_fasta

Path to genome fasta file. Only required if using a shuffled background. Default: None

Type:

str, optional.

path_to_motifs

Path to motif collection folder (in .cb format). Only required if using a shuffled background. Default: None

Type:

str, optional.

genome_annotation

Pyranges containing genome annotation (e.g. biomart). Only required if using promoter balance. Default: None

Type:

pr.PyRanges, optional.

promoter_space

Space around TSS to consider a region promoter. Only used if using promoter balance. Default: 1000

Type:

int, optional

path_to_motif_annotations

Path to motif annotations. If not provided, they will be downloaded from https://resources.aertslab.org based on the specie name provided (only possible for mus_musculus, homo_sapiens and drosophila_melanogaster). Default: None

Type:

str, optional

motif_similarity_fdr

Minimal motif similarity value to consider two motifs similar. Default: 0.001

Type:

float, optional

orthologous_identity_threshold

Minimal orthology value for considering two TFs orthologous. Default: 0.0

Type:

float, optional

motifs_to_use

A subset of motifs to use for the analysis. Default: None (All)

Type:

List, optional

tmp_dir

Temp directory to use if running cluster_buster. Default: None ( mp)

Type:

str, optional

motif_enrichment

A dataframe containing motif enrichment results

Type:

pd.DataFrame

motif_hits

A dictionary containing regions that are considered enriched for each motif.

Type:

Dict

cistromes

A dictionary containing TF cistromes. Cistromes with no extension contain regions linked to directly annotated motifs, while ‘_extended’ cistromes can contain regions linked to motifs annotated by similarity or orthology.

Type:

Dict

Methods

DEM_results([name])

Print motif enrichment table as HTML

add_motif_annotation_dem([add_logo])

Add motif annotation

run(dem_db_scores, **kwargs)

Run DEM

DEM_results(name: str | None = None)[source]

Print motif enrichment table as HTML

add_motif_annotation_dem(add_logo: bool | None = True)[source]

Add motif annotation

Parameters:

add_logo (boolean, optional) – Whether to add the motif logo to the motif enrichment dataframe

run(dem_db_scores, **kwargs)[source]

Run DEM

Parameters:
  • dem_db_scores (pd.DataFrame) – A dataframe containing maximum CRM score for each motif in each regions.

  • **kwargs – Additional parameters to pass to ray.init()

class pycistarget.motif_enrichment_dem.DEMDatabase(fname: str, region_sets: Dict[str, PyRanges] | None = None, name: str | None = None, fraction_overlap: float = 0.4)[source]

DEM Database class. DEMDatabase contains a dataframe with motifs as rows, regions as columns and CRM scores as values. In addition, is contains a slot to map query regions to regions in the database. For more information on how to generate databases, please visit: https://github.com/aertslab/create_cisTarget_databases

regions_to_db

A dataframe containing the mapping between query regions and regions in the database.

Type:

pd.DataFrame

db_scores

A dataframe with motifs as rows, regions as columns and CRM scores as values.

Type:

pd.DataFrame

total_regions

Total number of regions in the database

Type:

int

Methods

load_db(fname[, region_sets, name, ...])

Load DEMDatabase

load_db(fname: str, region_sets: Dict[str, PyRanges] | None = None, name: str | None = None, fraction_overlap: float = 0.4)[source]

Load DEMDatabase

Parameters:
  • fname (str) – Path to feather file containing the DEM database (regions_vs_motifs)

  • region_sets (Dict or pr.PyRanges, optional) – Dictionary or pr.PyRanges that are going to be analyzed with DEM. Default: None.

  • name (str, optional) – Name for the DEM database. Default: None

  • fraction_overlap (float, optional) – Minimal overlap between query and regions in the database for the mapping.

pycistarget.motif_enrichment_dem.DEM_internal(dem_db_scores: DataFrame, region_group: List[List[str]], contrast_name: str, adjpval_thr: float | None = 0.05, log2fc_thr: float | None = 1, mean_fg_thr: float | None = 0, motif_hit_thr: float | None = None)[source]

Internal operations for DEM.

pycistarget.motif_enrichment_dem.create_groups(contrast: list, region_sets_names: list, max_bg_regions: int, path_to_genome_fasta: str, path_to_regions_fasta: str, cbust_path: str, path_to_motifs: str, annotation: PyRanges | None = None, promoter_space: int = 1000, motifs: list | None = None, n_cpu: int = 1, **kwargs)[source]

” Format contrast groups

pycistarget.motif_enrichment_dem.get_motif_hits(scores, regions, labels, optimal_threshold=None)[source]

Determine optimal score threshold based on precision-recall.

pycistarget.motif_enrichment_dem.p_adjust_bh(p: float)[source]

Benjamini-Hochberg p-value correction for multiple hypothesis testing.

pycistarget.motif_enrichment_dem.shuffle_sequence(sequence: str)[source]

Shuffle given sequence

Homer

class pycistarget.motif_enrichment_homer.Homer(homer_path: str, bed_path: str, name: str, outdir: str, genome: str, size: str = 'given', mask: bool = True, denovo: bool = False, length: str = '8,10,12', meme_path: str | None = None, meme_collection_path: str | None = None, path_to_motif_annotations: str | None = None, annotation_version: str = 'v9', cistrome_annotation: List[str] = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'], motif_similarity_fdr: float = 0.001, orthologous_identity_threshold: float = 0.0)[source]

Homer class. Homer contains Homer for motif enrichment analysis on sets of regions.

homer_path

Path to Homer bin folder.

Type:

str

bed_path

Path to bed file containing region set to be analyzed with Homer.

Type:

str

name

Analysis name.

Type:

str

outdir

Path to folder to output Homer results.

Type:

str

genome

Homer genome label to use.

Type:

str

size

Fragment size to use for motif finding. Default: ‘given’ [uses the exact regions you give it]

Type:

str, optional

mask

Whether to mask repeats or not. Default: True

Type:

bool, optional

denovo

Whether to infer overrepresented motifs de novo. Default: False

Type:

bool, optional

length

Motif length values. Default: 8,10,12

Type:

str, optional

meme_path

Path to meme bin folder. Meme will be used if given for motif annotation. Default: None

Type:

str, optional

meme_collection_path

Path to motif collection (in .cb format) to compare homer motifs with. Default: None

Type:

str, optional

path_to_motif_annotations

Path to motif annotations. If not provided, they will be downloaded from https://resources.aertslab.org based on the specie name provided (only possible for mus_musculus, homo_sapiens and drosophila_melanogaster). Default: None

Type:

str, optional

annotation_version

Motif collection version. Default: v9

Type:

str, optional

cistrome_annotation

Annotation to use for forming cistromes. It can be ‘Direct_annot’ (direct evidence that the motif is linked to that TF), ‘Motif_similarity_annot’ (based on tomtom motif similarity), ‘Orthology_annot’ (based on orthology with a TF that is directly linked to that motif) or ‘Motif_similarity_and_Orthology_annot’. Default: [‘Direct_annot’, ‘Motif_similarity_annot’, ‘Orthology_annot’, ‘Motif_similarity_and_Orthology_annot’]

Type:

List, optional

motif_similarity_fdr

Minimal motif similarity value to consider two motifs similar. Default: 0.001

Type:

float, optional

orthologous_identity_threshold

Minimal orthology value for considering two TFs orthologous. Default: 0.0

Type:

float, optional

known_motifs

A dataframe containing known motif enrichment results.

Type:

pd.DataFrame

denovo_motifs

A dataframe containing de novo motif enrichment results.

Type:

pd.DataFrame

known_motif_hits

A dictionary containing regions with motif hits for each known motif.

Type:

Dict

denovo_motif_hits

A dictionary containing regions with motif hits for each de novo motif.

Type:

Dict

known_cistromes

A dictionary containing regions with motif hits for each TF found with known motifs.

Type:

Dict

denovo_motif_hits

A dictionary containing regions with motif hits for each TF found de novo.

Type:

Dict

References

Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432

Methods

add_motif_annotation_homer()

Add motif annotations (based on Homer, cisTarget and meme if specified)

find_motif_hits([n_cpu])

Find motif hits with homer2 find

get_cistromes([annotation])

Format cistromes per TF

load_denovo()

Load de novo motif enrichment results from file.

load_known()

Load known motif enrichment results from file.

run()

Run Homer

add_motif_annotation_homer()[source]

Add motif annotations (based on Homer, cisTarget and meme if specified)

find_motif_hits(n_cpu=1)[source]

Find motif hits with homer2 find

Parameters:

n_cpu (int) – Number of cores to use.

get_cistromes(annotation: List[str] = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'])[source]

Format cistromes per TF

Parameters:

cistrome_annotation (List, optional) – Annotation to use for forming cistromes. It can be ‘Direct_annot’ (direct evidence that the motif is linked to that TF), ‘Motif_similarity_annot’ (based on tomtom motif similarity), ‘Orthology_annot’ (based on orthology with a TF that is directly linked to that motif) or ‘Motif_similarity_and_Orthology_annot’. Default: [‘Direct_annot’, ‘Motif_similarity_annot’, ‘Orthology_annot’, ‘Motif_similarity_and_Orthology_annot’]

load_denovo()[source]

Load de novo motif enrichment results from file.

load_known()[source]

Load known motif enrichment results from file.

run()[source]

Run Homer

pycistarget.motif_enrichment_homer.homer_results(homer_dict, name, results='known')[source]

A function to show Homer results in jupyter notebooks.

Parameters:
  • Homer_dict (Dict) – A dictionary with one Homer object per slot.

  • name (str) – Dictionary key of the analysis result to show. Default: None (All)

  • results (str) – Whether to show know or de novo results. Default: ‘known’

pycistarget.motif_enrichment_homer.run_homer(homer_path: str, region_sets: Dict[str, PyRanges], outdir: str, genome: str, size: str = 'given', mask: bool = True, denovo: bool = False, length: str = '8,10,12', n_cpu: int = 1, meme_path: str | None = None, meme_collection_path: str | None = None, path_to_motif_annotations: str | None = None, annotation_version: str = 'v9', cistrome_annotation: List[str] = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'], motif_similarity_fdr: float = 0.001, orthologous_identity_threshold: float = 0.0, **kwargs)[source]

Run Homer

Parameters:
  • homer_path (str) – Path to Homer bin folder.

  • region_sets (Dict) – A dictionary of PyRanges containing region coordinates for the region sets to be analyzed.

  • outdir (str) – Path to folder to output Homer results.

  • genome (str) – Homer genome label to use.

  • size (str, optional) – Fragment size to use for motif finding. Default: ‘given’ [uses the exact regions you give it]

  • mask (bool, optional) – Whether to mask repeats or not. Default: True

  • denovo (bool, optional) – Whether to infer overrepresented motifs de novo. Default: False

  • length (str, optional) – Motif length values. Default: 8,10,12

  • n_cpu (int) – Number of cores to use.

  • meme_path (str, optional) – Path to meme bin folder. Meme will be used if given for motif annotation. Default: None

  • meme_collection_path (str, optional) – Path to motif collection (in .cb format) to compare homer motifs with. Default: None

  • path_to_motif_annotations (str, optional) – Path to motif annotations. If not provided, they will be downloaded from https://resources.aertslab.org based on the specie name provided (only possible for mus_musculus, homo_sapiens and drosophila_melanogaster). Default: None

  • annotation_version (str, optional) – Motif collection version. Default: v9

  • cistrome_annotation (List, optional) – Annotation to use for forming cistromes. It can be ‘Direct_annot’ (direct evidence that the motif is linked to that TF), ‘Motif_similarity_annot’ (based on tomtom motif similarity), ‘Orthology_annot’ (based on orthology with a TF that is directly linked to that motif) or ‘Motif_similarity_and_Orthology_annot’. Default: [‘Direct_annot’, ‘Motif_similarity_annot’, ‘Orthology_annot’, ‘Motif_similarity_and_Orthology_annot’]

  • motif_similarity_fdr (float, optional) – Minimal motif similarity value to consider two motifs similar. Default: 0.001

  • orthologous_identity_threshold (float, optional) – Minimal orthology value for considering two TFs orthologous. Default: 0.0

  • **kwargs – Extra parameters to pass to ray.init().

References

Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432

Cluster-Buster

pycistarget.cluster_buster.cluster_buster(cbust_path: str, path_to_motifs: str, region_sets: Dict[str, PyRanges] | Dict[str, List] | None = None, path_to_genome_fasta: str | None = None, path_to_regions_fasta: str | None = None, n_cpu: int | None = 1, motifs: List[str] | None = None, verbose: bool | None = False, **kwargs)[source]

Add motif annotation

Parameters:
  • cluster_buster_path (str) – Path to cluster buster bin.

  • path_to_motifs (str, optional.) – Path to motif collection folder (in .cb format). Only required if using a shuffled background.

  • region_sets (Dict) – A dictionary of PyRanges containing region coordinates for the regions to be analyzed. Only required if path_to_regions_fasta is not provided.

  • path_to_genome_fasta (str, optional.) – Path to genome fasta file. Only required if path_to_regions_fasta is not provided. Default: None

  • path_to_regions_fasta (str, optional.) – Path to regions fasta file. Only required if path_to_genome_fasta is not provided. Default: None

  • n_cpu (int, optional) – Number of cores to use

  • motifs (List, optional) – Names of the motif files to use (from path_to_motifs). Default: None (All)

  • verbose (bool, optional) – Whether to print progress to screen

  • **kwargs – Additional parameters to pass to ray.init()

References

Frith, Martin C., Michael C. Li, and Zhiping Weng. “Cluster-Buster: Finding dense clusters of motifs in DNA sequences.” Nucleic acids research 31, no. 13 (2003): 3666-3668.

pycistarget.cluster_buster.get_sequence_names_from_fasta(fasta_filename: str)[source]

Retrieve sequence names from fasta

pycistarget.cluster_buster.grep(l: List, s: str)[source]

Helper for grep

pycistarget.cluster_buster.pyranges2names(regions: PyRanges)[source]

Convert pyranges to sequence name (fasta format)

Utils

pycistarget.utils.coord_to_region_names(coord: PyRanges)[source]

Convert coordinates to region names (UCSC format)

pycistarget.utils.get_TF_list(motif_enrichment_table: DataFrame, annotation: List[str] = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'])[source]

Get TFs from motif enrichment tables

pycistarget.utils.get_cistrome_per_TF(motif_hits_dict, motifs)[source]

Format cistromes per TF

pycistarget.utils.get_cistromes_per_region_set(motif_enrichment_region_set, motif_hits_regions_set, annotation: List[str] = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'])[source]

Get (direct/extended) cistromes for TFs

pycistarget.utils.get_motifs_per_TF(motif_enrichment_table: DataFrame, tf: str, motif_column: str, annotation: List[str] = ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot'])[source]

Get motif annotated to each TF from a motif enrichment table

pycistarget.utils.get_position_index(query_list, target_list)[source]

Get position of a query within a list

pycistarget.utils.homer2meme(homer_motif_path: str)[source]

Convert Homer motifs to meme format

pycistarget.utils.inplace_change(filename, old_string, new_string)[source]

Replace string in a file

pycistarget.utils.load_motif_annotations(specie: str, version: str = 'v9', fname: str | None = None, column_names=('#motif_id', 'gene_name', 'motif_similarity_qvalue', 'orthologous_identity', 'description'), motif_similarity_fdr: float = 0.001, orthologous_identity_threshold: float = 0.0)[source]

Load motif annotations from a motif2TF snapshot.

Parameters:
  • specie – Specie to retrieve annotations for.

  • version – Motif collection version.

  • fname – The snapshot taken from motif2TF.

  • column_names – The names of the columns in the snapshot to load.

  • motif_similarity_fdr – The maximum False Discovery Rate to find factor annotations for enriched motifs.

  • orthologuous_identity_threshold – The minimum orthologuous identity to find factor annotations for enriched motifs.

pycistarget.utils.region_names_to_coordinates(region_names: List)[source]

Convert region names (UCSC format) to coordinates (pd.DataFrame)

pycistarget.utils.region_sets_to_signature(region_set: list, region_set_name: str)[source]

Generates a gene signature object from a dict of PyRanges objects

Parameters:
  • pr_region_set – PyRanges object to be converted in genesignature object

  • region_set_name – Name of the regions set

pycistarget.utils.target_to_query(target: PyRanges | List[str], query: PyRanges | List[str], fraction_overlap: float = 0.4)[source]

Map query regions to another set of regions

pycistarget.utils.tomtom(homer_motif_path: str, meme_path: str, meme_collection_path: str)[source]

Run tomtom for Homer motif annotation