Using pycisTarget within the SCENIC+ workflow

[1]:
%matplotlib inline
import pycistarget
pycistarget.__version__
[1]:
'1.0.1.dev29+g95fd9fb'

As part of the SCENIC+ workflow we provide a wrapper to run pycistarget with the recommended settings. This approach will run cistarget and DEM, in the original region sets (binarized topics, DARs, …) and without promoters when indicated (to improve the signal of non-promoter motifs if region sets include a large proportion of promoters). For DEM, we use as background regions in other regions sets in the dictionary. This step can be run using a dataset specific database or a precomputed database.

1. Using a dataset specific database

If the consensus regions in the dataset do not overlap with regions in the precomputed database (SCREEN regions for mouse and human, cisTarget regions for Drosophila), we recommend to generate a dataset specific database using the consensus peaks as regions. When using a precomputed database, dataset regions that do not overlap with the precomputed regions will be removed from the analysis, as they will not be found in cistromes.

As a first step, we need to generate the dataset specific database using the code available at https://github.com/aertslab/create_cisTarget_databases. Below you can find the basic steps to do so:

[ ]:
%%bash
# Paths and parameters
consensdir='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/data/MACS_ATAC/iterative/peak_filtering_norm'
outdir='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/ctx_db'
tag='cluster_V10_DPCL'
genomefa='/staging/leuven/stg_00002/lcb/resources/human/hg38/hg38.fa'
ncpu=36
cbdir='/staging/leuven/stg_00002/lcb/cbravo/cluster_motif_collection_V10_no_desso_no_factorbook/cluster_buster/'
# Create outdir
if [ ! -d $outdir ]
then
    mkdir -p $outdir
fi
#### List of motifs
motif_list='/staging/leuven/stg_00002/lcb/cbravo/cluster_motif_collection_V10_no_desso_no_factorbook/annotated_motifs/motifs-v10-nr.more_orthology.hgnc-mm0.00001-o0.0_clust.tsv'
#### Get fasta sequences
echo "Extracting FASTA ..."
module load BEDTools
bedtools getfasta -fi $genomefa -bed $consensdir/combined_summits_final.bed > $consensdir/consensus_regions.fa
echo "Done."
#### Create scores DB
echo "Creating scores DB files ..."
#### Activate environment
conda_initialize /staging/leuven/stg_00002/lcb/ghuls/software/miniconda3/
conda activate create_cistarget_databases
#### Set ${create_cistarget_databases_dir} to https://github.com/aertslab/create_cisTarget_databases
create_cistarget_databases_dir='/staging/leuven/stg_00002/lcb/ghuls/software/create_cisTarget_databases'
#### Score the motifs in 1 chunks; we will use the non-redundant db here
for current_part in {1..1} ; do
     ${create_cistarget_databases_dir}/create_cistarget_motif_databases.py \
         -f $consensdir/consensus_regions.fa \
         -M $cbdir \
         -m $motif_list \
         -p ${current_part} 1 \
         -o $outdir/$tag \
         -t 35 \
         -l
done
echo "Done."
#### Create rankings
echo "Creating rankings DB files ..."
${create_cistarget_databases_dir}/convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py -i $outdir/$tag.motifs_vs_regions.scores.feather -s 555
echo "Done."
echo "ALL DONE."

Next you will need to load the region sets you want to include in the analysis. By default, we use binarized topics and DARs, but you can include additional analyses:

[ ]:
# Load region binarized topics
import pickle
outDir = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/pycisTopic/'
infile = open(outDir+'topic_binarization/binarized_topic_region.pkl', 'rb')
binarized_topic_region = pickle.load(infile)
infile.close()
# Load DARs
import pickle
infile = open(outDir+'DARs/DARs.pkl', 'rb')
DARs_dict = pickle.load(infile)
infile.close()
# Format region sets
import re
import pyranges as pr
from pycistarget.utils import *
region_sets = {}
region_sets['Topics'] = {key: pr.PyRanges(region_names_to_coordinates(binarized_topic_region[key].index.tolist())) for key in binarized_topic_region.keys()}
region_sets['DARs'] = {re.sub('[^A-Za-z0-9]+', '_', key): pr.PyRanges(region_names_to_coordinates(DARs_dict[key].index.tolist())) for key in DARs_dict.keys()}

Finally we run the wrapper function. This will create a menr.pkl file in the output directory that can be directly used as input for SCENIC+.

[ ]:
# Run pycistarget
# run_without_promoters = True, will run the methods in all regions + the region sets without promoters
import os
os.chdir('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/scenicplus/src/')
from scenicplus.wrappers.run_pycistarget import *
run_pycistarget(region_sets,
                 ctx_db_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/ctx_db/cluster_V10_DPCL.regions_vs_motifs.rankings.feather',
                 species = 'homo_sapiens',
                 save_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/pycistarget/cluster_V10_V2/',
                 dem_db_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/ctx_db/cluster_V10_DPCL.regions_vs_motifs.scores.feather',
                 run_without_promoters = True,
                 biomart_host = 'http://www.ensembl.org',
                 promoter_space = 500,
                 ctx_auc_threshold = 0.005,
                 ctx_nes_threshold = 3.0,
                 ctx_rank_threshold = 0.05,
                 dem_log2fc_thr = 0.5,
                 dem_motif_hit_thr = 3.0,
                 dem_max_bg_regions = 500,
                 path_to_motif_annotations = '/staging/leuven/stg_00002/lcb/cbravo/cluster_motif_collection_V10_no_desso_no_factorbook/snapshots/motifs-v10-nr.more_orthology.hgnc-mm0.00001-o0.0_clust.tsv',
                 annotation_version = 'v10nr_clust',
                 annotation = ['Direct_annot', 'Orthology_annot'],
                 n_cpu = 1,
                 _temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')

2. Using a precomputed database

Alternatively you can use a precomputed database. This will reduce the running time as you can skip the generation of a database; however, regions not overlapping regions in the database will be missed. If the overlap is significant (>80%), using this database will give similar results. Precomputed databases are available at https://resources.aertslab.org/cistarget/.

We first load the region sets as before:

[ ]:
# Load region binarized topics
import pickle
outDir = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/pycisTopic/'
infile = open(outDir+'topic_binarization/binarized_topic_region.pkl', 'rb')
binarized_topic_region = pickle.load(infile)
infile.close()
# Load DARs
import pickle
infile = open(outDir+'DARs/DARs.pkl', 'rb')
DARs_dict = pickle.load(infile)
infile.close()
# Format region sets
import re
import pyranges as pr
from pycistarget.utils import *
region_sets = {}
region_sets['Topics'] = {key: pr.PyRanges(region_names_to_coordinates(binarized_topic_region[key].index.tolist())) for key in binarized_topic_region.keys()}
region_sets['DARs'] = {re.sub('[^A-Za-z0-9]+', '_', key): pr.PyRanges(region_names_to_coordinates(DARs_dict[key].index.tolist())) for key in DARs_dict.keys()}

Finally, we run the pycistarget wrapper, but using the precomputed database instead.

[ ]:
# Run pycistarget
# run_without_promoters = True, will run the methods in all regions + the region sets without promoters
import os
os.chdir('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/scenicplus/src/')
from scenicplus.wrappers.run_pycistarget import *
run_pycistarget(region_sets,
                 ctx_db_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/hg38_SCREEN_cluster_db/cluster_SCREEN.regions_vs_motifs.rankings.feather',
                 species = 'homo_sapiens',
                 save_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/pycistarget/SCREEN_cluster_V10_V2/',
                 dem_db_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/hg38_SCREEN_cluster_db/cluster_SCREEN.regions_vs_motifs.scores.feather',
                 run_without_promoters = True,
                 biomart_host = 'http://www.ensembl.org',
                 promoter_space = 500,
                 ctx_auc_threshold = 0.005,
                 ctx_nes_threshold = 3.0,
                 ctx_rank_threshold = 0.05,
                 dem_log2fc_thr = 0.5,
                 dem_motif_hit_thr = 3.0,
                 dem_max_bg_regions = 500,
                 path_to_motif_annotations = '/staging/leuven/stg_00002/lcb/cbravo/cluster_motif_collection_V10_no_desso_no_factorbook/snapshots/motifs-v10-nr.more_orthology.hgnc-mm0.00001-o0.0_clust.tsv',
                 annotation_version = 'v10nr_clust',
                 annotation = ['Direct_annot', 'Orthology_annot'],
                 n_cpu = 1,
                 _temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')