PCAWGScout

Help

This workflow offers several functionalities to explore the consequence of protein mutations. It reports features that overlap the mutations, or that are in close physical proximity.

The features reported include protein domains, variants, helices, ligand binding residues, catalytic sites, transmembrane domains, InterPro domains, and known somatic mutations in different types of cancer. This information is extracted from resources such as UniProt, COSMIC, InterPro and Appris. It can also identify mutations affecting the interfaces of protein complexes.

This workflow makes use of PDB files to calculate residues in close proximity. This information is used to find features close to the mutations, at a distance of 5 angstroms, or mutations in residues close to residues in a complex partner, at a distance of up to 8 angstroms.

PDBs are extracted from Interactome3d, which organized thousands of PDBs, for both experimental structures and structure models, of individual proteins and protein complexes.

Pairwise (Smith-Waterman) alignment is used to fix all inconsistencies between protein sequences in PDBs, Uniprot and Ensembl Protein ID.

Reference:

Vazquez M, Valencia A, Pons T. (2015) Structure-PPi: a module for the annotation of cancer-related single-nucleotide variants at protein-protein interfaces. Bioinformatics (2015); 31(14):2397-2399 (doi: 10.1093/bioinformatics/btv142)

Wizard

Use the following textbox to input your mutations and retrieve all annotations, including neighbours and interfaces. This method is limited to 1000 variants, use the other (more granular) tasks if your mutation set is larger. Mutations can be specified as genomic mutation 18:6237978:G, a mutated isoform ENSP00000382976:L257R, or using any identifier instead of the Ensembl Protein ID such as Associated Gene Name or gene symbol KRAS:G12V.

If genomic mutations are given, only principal isoforms are considered. If the protein is specified with any id other than Ensembl Protein ID, it will be translated to Ensembl Gene ID and then its principal isoform will be extracted from Appris. For instance, if the mutation is given using UniProt/SwissProt Accession, and the change is relative to the sequence reported in UniProt, inconsistencies may appear from wrong isoform mappings or due to discrepancies in the sequence. No attempt is made to fix such inconsistencies in this wizard.

The organism is assumed to be Hsa/feb2014. If genomic mutations are introduced, they are assumed to be relative to the watson or forward strand.

Scores

While Structure-PPi itself is not intended to be an stand-alone damage predictor, we provide a score, the Structure-PPi feature score, that quantifies the protein features that are overlapping or close to each mutation. The score is built by adding individual scores for the different features. The individual score that each feature contributes has been selected based on expert opinion and guided by empirical results on the COSMIC and 1000 Genomes data. The scoring scheme is as follows:

Appris features: we add 2 if at least one ligand binding or catalytic site annotated in firestar is affected; if none of the affected features meets this condition we add only 1
COSMIC mutations: 3 if more that ten COSMIC samples have mutations overlapping the residue, 2 if its more that five, and 1 if its more than one sample. We add nothing if just one sample is found
UniProt variants: 1 if the position has at least one variant annotated. If at least one of these variants is also annotated as Disease we add 2 more. If none is classified as Disease but at least one is annotated as Unclassified we add 1 more. If all are annotated as Polymorphism we add nothing more.
UniProt features: We add 1 if any of the following features are affected MUTAGEN, DISULFID, DNA_BIND, METAL, INTRAMEM, CROSSLNK. These features show a frequency that is more than double in COSMIC with respect to 1000 Genomes. MUTAGEN entries are only considered if the description field does not include the text 'No effect'
Affected interfaces: We add 2 if any protein-protein interaction surface is affected

These scores are calculated for the direct hits and for the neighbour hits (with the exception of affected interfaces, where it doesn't apply). Scores for neighbours are divided by 2. The final tally is reported under the section Damage predictions in the wizard report

Precomputed results

The following files contain reports for all mutations in the COSMIC and 1000 Genomes databases. The where produced using the Structure-PPI and Sequence workflows. Due to the large size of these datasets, we have skipped annotation with the `COSMIC` database itself, which would have resulted in massive result files.

Tasks

annotate: Annotates genomic mutations based on the protein features that are overlapping amino-acid changes
annotate_mi: Annotates mutated isoforms based on the protein features that are overlapping amino-acid changes
annotate_mi_neighbours: Annotates mutated isoforms based on the protein features that are in close physical proximity to amino-acid changes
annotate_neighbours: Annotates genomic mutations based on the protein features that are in close physical proximity to amino-acid changes
interfaces: Find variants that affect residues in protein-protein interaction surfaces
mi_interfaces: Find mutated_isoforms with affected residues in protein-protein interaction sufaces
mi_neighbours: Finds residues physical proximity to amino-acid changes in mutated isoforms
neighbour_map: For a given PDB, find all pairs of residues in a PDB that fall within a given 'distance' of each other. It uses PDBs from Interactome3d for individual proteins.
neighbours_in_pdb: Use a pdb to find the residues neighbouring, in three dimensional space, a particular residue in a given sequence.
pdb_alignment_map: Find the correspondence between sequence positions in a PDB and in a given sequence. PDB positions are reported as `chain:position`.
pdb_chain_position_in_sequence: Translate the positions of amino-acids in a particular chain of the provided PDB into positions inside a given sequence.
score_summary: Produce a small table summarizing the mutation scores and a few of the features
scores: Score a list of variants based on the report generated by the `wizard`. The limitation to 1000 variants still holds.
sequence_position_in_pdb: Translate the positions inside a given amino-acid sequence to positions in the sequence of a PDB by aligning them
wizard: Run a list of variants through all the analysis and produce a combined report. This analysis is limited to 1000 variants (use the other more granular methods otherwise). Variants can be expressed as genomic mutations or protein mutations. When protein mutations are used, the name of the protein can be `Ensembl Protein ID` or any other protein or gene identifier, including gene symbols (e.g. KRAS:G12V)