This workflow offers several functionalities to explore the consequence of protein mutations. It reports features that overlap the mutations, or that are in close physical proximity.
The features reported include protein domains, variants, helices, ligand binding residues, catalytic sites, transmembrane domains, InterPro domains, and known somatic mutations in different types of cancer. This information is extracted from resources such as UniProt, COSMIC, InterPro and Appris. It can also identify mutations affecting the interfaces of protein complexes.
This workflow makes use of PDB files to calculate residues in close proximity. This information is used to find features close to the mutations, at a distance of 5 angstroms, or mutations in residues close to residues in a complex partner, at a distance of up to 8 angstroms.
PDBs are extracted from Interactome3d, which organized thousands of PDBs, for both experimental structures and structure models, of individual proteins and protein complexes.
Pairwise (Smith-Waterman) alignment is used to fix all inconsistencies between protein sequences in PDBs, Uniprot and Ensembl Protein ID.
Reference:
Vazquez M, Valencia A, Pons T. (2015) Structure-PPi: a module for the annotation of cancer-related single-nucleotide variants at protein-protein interfaces. Bioinformatics (2015); 31(14):2397-2399 (doi: 10.1093/bioinformatics/btv142)
Wizard
Use the following textbox to input your mutations and retrieve all
annotations, including neighbours and interfaces. This method is limited
to 1000 variants, use the other (more granular) tasks if your mutation set
is larger. Mutations can be specified as genomic mutation 18:6237978:G
, a
mutated isoform ENSP00000382976:L257R
, or using any identifier instead of
the Ensembl Protein ID
such as Associated Gene Name
or gene symbol KRAS:G12V
.
If genomic mutations are given, only principal isoforms are considered. If
the protein is specified with any id other than Ensembl Protein ID
, it
will be translated to Ensembl Gene ID
and then its principal isoform will
be extracted from Appris. For instance, if the mutation is given using
UniProt/SwissProt Accession
, and the change is relative to the sequence
reported in UniProt, inconsistencies may appear from wrong isoform mappings
or due to discrepancies in the sequence. No attempt is made to fix such
inconsistencies in this wizard.
The organism is assumed to be Hsa/feb2014
. If genomic
mutations are introduced, they are assumed to be relative to the watson or
forward strand.
Scores
While Structure-PPi itself is not intended to be an stand-alone damage
predictor, we provide a score, the Structure-PPi feature score
, that
quantifies the protein features that are overlapping or close to each
mutation. The score is built by adding individual scores for the
different features. The individual score that each feature contributes
has been selected based on expert opinion and guided by empirical results on
the COSMIC
and 1000 Genomes
data. The scoring scheme is as follows:
-
Appris features: we add 2 if at least one ligand binding or catalytic site annotated in
firestar
is affected; if none of the affected features meets this condition we add only 1 -
COSMIC mutations: 3 if more that ten COSMIC samples have mutations overlapping the residue, 2 if its more that five, and 1 if its more than one sample. We add nothing if just one sample is found
-
UniProt variants: 1 if the position has at least one variant annotated. If at least one of these variants is also annotated as
Disease
we add 2 more. If none is classified asDisease
but at least one is annotated asUnclassified
we add 1 more. If all are annotated asPolymorphism
we add nothing more. -
UniProt features: We add 1 if any of the following features are affected
MUTAGEN, DISULFID, DNA_BIND, METAL, INTRAMEM, CROSSLNK
. These features show a frequency that is more than double in COSMIC with respect to 1000 Genomes. MUTAGEN entries are only considered if the description field does not include the text 'No effect' -
Affected interfaces: We add 2 if any protein-protein interaction surface is affected
These scores are calculated for the direct hits and for the neighbour
hits (with the exception of affected interfaces, where it doesn't apply).
Scores for neighbours are divided by 2. The final tally is reported
under the section Damage predictions
in the wizard report
Precomputed results
The following files contain reports for all mutations in the COSMIC and 1000 Genomes databases. The where produced using the Structure-PPI and Sequence workflows. Due to the large size of these datasets, we have skipped annotation with the `COSMIC` database itself, which would have resulted in massive result files.
- COSMIC:all - genomic_mutation_annotations/consequence
- COSMIC:all - genomic_mutation_annotations/mutation_genes
- COSMIC:all - genomic_mutation_annotations/mutation_mi_annotations
- COSMIC:all - mutated_isoform_annotations/Appris
- COSMIC:all - mutated_isoform_annotations/InterPro
- COSMIC:all - mutated_isoform_annotations/UniProt
- COSMIC:all - mutated_isoform_annotations/db_NSFP
- COSMIC:all - mutated_isoform_annotations/interfaces
- COSMIC:all - mutated_isoform_annotations/variants
- COSMIC:all - mutated_isoform_neighbour_annotations/Appris
- COSMIC:all - mutated_isoform_neighbour_annotations/InterPro
- COSMIC:all - mutated_isoform_neighbour_annotations/UniProt
- COSMIC:all - mutated_isoform_neighbour_annotations/variants
- Genomes1000:all - genomic_mutation_annotations/consequence
- Genomes1000:all - genomic_mutation_annotations/mutation_genes
- Genomes1000:all - genomic_mutation_annotations/mutation_mi_annotations
- Genomes1000:all - mutated_isoform_annotations/Appris
- Genomes1000:all - mutated_isoform_annotations/InterPro
- Genomes1000:all - mutated_isoform_annotations/UniProt
- Genomes1000:all - mutated_isoform_annotations/db_NSFP
- Genomes1000:all - mutated_isoform_annotations/interfaces
- Genomes1000:all - mutated_isoform_annotations/variants
- Genomes1000:all - mutated_isoform_neighbour_annotations/Appris
- Genomes1000:all - mutated_isoform_neighbour_annotations/InterPro
- Genomes1000:all - mutated_isoform_neighbour_annotations/UniProt
- Genomes1000:all - mutated_isoform_neighbour_annotations/variants
Tasks
- annotate
-
Annotates genomic mutations based on the protein features that are overlapping amino-acid changes
- annotate_mi
-
Annotates mutated isoforms based on the protein features that are overlapping amino-acid changes
- annotate_mi_neighbours
-
Annotates mutated isoforms based on the protein features that are in close physical proximity to amino-acid changes
- annotate_neighbours
-
Annotates genomic mutations based on the protein features that are in close physical proximity to amino-acid changes
- interfaces
-
Find variants that affect residues in protein-protein interaction surfaces
- mi_interfaces
-
Find mutated_isoforms with affected residues in protein-protein interaction sufaces
- mi_neighbours
-
Finds residues physical proximity to amino-acid changes in mutated isoforms
- neighbour_map
-
For a given PDB, find all pairs of residues in a PDB that fall within a given 'distance' of each other. It uses PDBs from Interactome3d for individual proteins.
- neighbours_in_pdb
-
Use a pdb to find the residues neighbouring, in three dimensional space, a particular residue in a given sequence.
- pdb_alignment_map
-
Find the correspondence between sequence positions in a PDB and in a given sequence. PDB positions are reported as `chain:position`.
- pdb_chain_position_in_sequence
-
Translate the positions of amino-acids in a particular chain of the provided PDB into positions inside a given sequence.
- score_summary
-
Produce a small table summarizing the mutation scores and a few of the features
- scores
-
Score a list of variants based on the report generated by the `wizard`. The limitation to 1000 variants still holds.
- sequence_position_in_pdb
-
Translate the positions inside a given amino-acid sequence to positions in the sequence of a PDB by aligning them
- wizard
-
Run a list of variants through all the analysis and produce a combined report. This analysis is limited to 1000 variants (use the other more granular methods otherwise). Variants can be expressed as genomic mutations or protein mutations. When protein mutations are used, the name of the protein can be `Ensembl Protein ID` or any other protein or gene identifier, including gene symbols (e.g. KRAS:G12V)