updated 28.09.15


ActiveDriver is a computational method for identifying 'active' sites in proteins (signalling sites, protein domains, regulatory motifs) that are specifically and significantly mutated in cancer genomes. ActiveDriver provides signalling-related interpretation of single nucleotide variants (SNVs) identified in cancer genome sequencing. We carried out a comprehensive analysis of somatic variants of phosphorylation sites and kinase domains in 800 cancer genomes.

ActiveDriver is based on a gene-centric logistic regression model that considers multiple factors in estimating significance of mutation enrichment (or depletion) in active sites. The factors include mutation frequency, distribution of active sites in protein sequence, their position with respect to mutations (direct and flanking), and structured and disordered regions of proteins.


Please refer to the following publications:
  • Jüri Reimand, Gary D. Bader: Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. (2013) Molecular Systems Biology, 9:637. doi:10.1038/msb.2012.68 [PDF].
    See also a Reasearch Highlight in Genome Medicine.
    Supplementary data (tables S1-S9) [ZIP].
  • Jüri Reimand, Omar Wagih, Gary D. Bader: The mutational landscape of phosphorylation signaling in cancer. (2013) Nature Scientific Reports, 2:2651. doi:10.1038/srep02651 [PDF].
    Supplementary data (tables S1-S8) [ZIP].
    Supplementary data (Synapse at syn2237931).
    Original pan-cancer 12 mutations from TCGA (Synapse at syn1729383).


ActiveDriver R package is available here (GNU GPL open source):
ActiveDriver_0.0.10.tar.gz [updated 06.04.2015; change log]

Example code [updated 06.04.15]

The following code shows an example analysis with ActiveDriver comprising seven genes with mutations in the TCGA pancancer project. The required input files can be found here. Uncompress the ZIP file as folder "pancan12_example" into the working directory of R.
# load library

# load required datasets
muts = read.delim("pancan12_example/mutations.txt")
sites = read.delim("pancan12_example/phosphosites.txt")
seqs = read_fasta("pancan12_example/sequences.fa")
disorder = read_fasta("pancan12_example/sequence_disorder.fa")

# run ActiveDriver
psnv_info = ActiveDriver(seqs, disorder, muts, sites)

# save gene-based p-values and merged report as CSV files
write.csv(psnv_info$all_gene_based_fdr, "pancan12_results_pvals.csv")
write.csv(psnv_info$merged_report, "pancan12_results_merged.csv")

# look at first few lines of every table in results
lapply(psnv_info, head)

Results explained

The above example produces an R list with six tables:
  • all_active_mutations - table of active mutations in sites or regions. The field active region identifies the mutated protein region (see below) and status defines mutation type (DI-direct, N1-close flanking, N2-distant flanking).
  • all_active_sites - table of all active sites in proteins, identified by the field active region. Position indicates first 'active' residue in protein sequence.
  • all_region_based_pval - table of site-based significance tests. Sites are identified by the field region. The fields med, low, high show expected mutation counts (+/- s.d.) and obs shows observed mutation counts.
  • all_gene_based_fdr - gene-based significance scores before and after FDR multiple testing correction.
  • all_active_regions - sequence coordinates of active regions in proteins. The field 'reg' corresponds to region ID as shown in table all_region_based_pval.
  • merged_report - table with each mutated active site region, sites in the region, and corresponding mutations.

Example data [Reimand et al, Mol Sys Biol 2013]

ActiveDriver requires the following four types of input. Example data originate from our first phosphosite paper.

Pancancer data [Reimand et al, Nat Sci Rep 2013]

The following files relate to our phosphosite analysis of TCGA pan-cancer mutations (published Oct. 2013).
  • 241,701 non-synonymous point mutations in in 3,185 tumor samples from pancan12 [ZIP] NB! these were re-mapped using Annovar to RefSeq sequences, see below. The original mutation files can be found in Synapse at syn1729383;
  • 87,898 phosphosites in protein sequences [ZIP];
  • protein sequences - longest isoforms for 18,671 human genes [ZIP];
  • Predicted disorder of protein sequences (from Disopred2) [ZIP].
  • Map of gene symbols and corresponding protein isoforms (RefSeq IDs) [ZIP].

Phosphosites and mutations mapped to human proteins in Ensembl 70 [unpublished; 28.09.15]

Zipped archive with protein sequences, sequence disorder, and phosphorylation sites for human proteins in Ensembl 70 can be downloaded from here.

ActiveDriver input files for HG38 [unpublished; 06.04.2015]

Zipped archive with protein sequences, sequence disorder, four types of PTM sites (phosphorylation, ubiquitination, acetylation, methylation), and pancan12 mutations converted to HG38 using LiftOver can be downloaded from here.

Contact us

Contact us at Juri.Reimand[at] and Gary.Bader[at]

Website of Jüri Reimand
Website of Bader lab
The Donnelly Centre for Cellular+Biomolecular Research
University of Toronto