Thursday, 20 March 2014

Lampreys and Hagfishes are now reunited in a monophyletic clade (Cyclostomata) in the NCBI Taxonomy

The NCBI Taxonomy is a powerful resource and provides many tools to search for relationship between organism:

However, according to their disclaimer, they don't pretend to be an "authoritative source for nomenclature or classification". Many nodes are unresolved, and some nodes don't reflect recent changes. For exemple, the paraphyly of Hyperotreti (Myxines) and Hyperoartia (Sea lampreys): 

In this context, Hagfishes are not considered as proper Vertebrates, but only Craniata. However, they are all Vertebrates and Hagfishes and Lampreys should be clustered in a monophyletic clade called Cyclostomata ("round mouth").

We can suggest to the NCBI team to make appropriate changes. For example, the introduction of the Dipnotetrapodomorpha clade, which group Dipnoi and Tetrapoda and let the Coelacanth clade as an outgroup:

I wrote to NCBI the following email (see the following extracts), and they kindly replied and followed my suggestions:

I am writing to request a few changes in the NCBI Taxonomy, in order to reflect recent findings in the phylogeny of the basal Vertebrates.

The current NCBI taxonomy at the basis of Chordata (7711) assumes a paraphyletic relationship of Sea Lamprey (Hyperoartia, 117569) and Hagfishes (Hyperotreti, 117565):

However, recent publications strongly support the monophyly group of Hyperoartia and Hyperotreti, in a clade named “Cyclostomata”. This clade is a sister-clade of the jawed vertebrates “Gnathostomata”:
Nature. 2014 Feb 12. doi: 10.1038/nature12980.
A primitive placoderm sheds light on the origin of the jawed vertebrate face.
Dupret V1, Sanchez S2, Goujet D3, Tafforeau P4, Ahlberg PE1.
Development. 2012 Jun;139(12):2091-9. doi: 10.1242/dev.074716.
Evolutionary crossroads in developmental biology: cyclostomes (lamprey and hagfish).
Shimeld SM1, Donoghue PC.
Proc Biol Sci. 2011 Apr 22;278(1709):1150-7. doi: 10.1098/rspb.2010.1641.
Decay of vertebrate characters in hagfish and lamprey (Cyclostomata) and the implications for the vertebrate fossil record.
Sansom RS1, Gabbott SE, Purnell MA.
Proc Natl Acad Sci U S A. 2010 Nov 9;107(45):19137-8. doi: 10.1073/pnas.1014583107.
microRNAs revive old views about jawless vertebrate divergence and evolution.
Janvier P.
Proc Natl Acad Sci U S A. 2010 Nov 9;107(45):19379-83. doi: 10.1073/pnas.1010350107.
microRNAs reveal the interrelationships of hagfish, lampreys, and gnathostomes and the nature of the ancestral vertebrate.
Heimberg AM1, Cowper-Sal-lari R, Sémon M, Donoghue PC, Peterson KJ.
Mol Biol Evol. 2009 Jan;26(1):47-59. doi: 10.1093/molbev/msn222.
Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after?
Kuraku S, Meyer A, Kuratani S.

I propose the following changes in the NCBI taxonomy:

- The “Craniata” clade (89593) should be replaced by the “Vertebrata” clade (7742). Craniata could eventually be used as a synonym of Vertebrata. 
- A “Cyclostomata” clade should be created within the “Vertebrata” clade (7742). This “Cyclostomata” clade would be a sister group of the “Gnathostomata” clade (7776).

- The “Hyperotreti” clade (117565) should be moved into the new “Cyclostomata” clade.

- Similarly, the “Hyperoartia” clade (117569) should be moved into the new “Cyclostomata” clade.

The new taxonomy would look like:

- Vertebrata (vertebrates)
--- Cyclostomata (living agnathans)
------ Hyperotreti (fish)
------------ Myxiniformes
------ Hyperoartia (fish)
------------ Petromyzontiformes
--- Gnathostomata (jawed vertebrates)
------ Chondrichthyes (cartilaginous fishes)
------ Teleostomi

We are using the NCBI taxonomy to automatize the clustering of genes based on the tree of life. It would be very useful for us if the NCBI taxonomy can reflect the recent supports of the monophyly of Hyperoartia and Hyperotreti in the Cyclostomata clade.

And here is the result a few days later:

Nice, isn't it ?

And the information is also transferered in the dump files:

The old one:
grep "Cyclostomata" names.dmp
97265    |    Cyclostomata                        |                |    synonym        |
97265    |    Cyclostomata Busk , 1852    |                |    authority        |
The new one:
grep "Cyclostomata" names.dmp
97265    |    Cyclostomata                       |    Cyclostomata <bryozoan>    |    synonym        |
97265    |    Cyclostomata Busk , 1852    |                |    authority    |
1476529    |    Cyclostomata              |    Cyclostomata <chordate>    |    scientific name    |

If all systematicians and other biologists around the world join their efforts and contribute to this resource, it will become a very accurate and powerful tool to assist molecular evolution analyses.

Friday, 20 December 2013

SMBE 2014 - Symposia with a particular focus on protein evolution

The list of Symposia for the SMBE (Society for Molecular Biology & Evolution) conference has been published:

I selected some that have a particular focus on protein evolution (so a direct link with this blog):

2. Biochemistry meets molecular evolution
Molecular evolution research is often focused on sequence analysis, treating genes and genomes as simple strings composed of letters A, C, G, and T. Yet these sequences represent real, three-dimensional molecules with complex structure and function. Many of the most fundamental breakthroughs in our understanding of molecular evolution have come from extracting core findings from biochemistry and molecular biology and incorporating them into evolutionary models and techniques. For example, the structure of the genetic code gave us dN/dS tests. In more recent years, knowledge about RNA secondary structures, about nucleosome positioning signals, and about protein folding free energies have all contributed to our understanding of molecular evolution.
This symposium will bring together researchers working at the interface of biochemistry and molecular evolution, and contribute to an extended evolutionary synthesis that brings biochemistry and molecular biology into the core of evolutionary thought. The invited speakers are biochemists who would not normally attend SMBE, whose talks will complement submitted talks from the evolutionary biology community. Our hope is that the symposium will be a showcase for the emerging synthesis, highlighting the diversity of biochemical facts that are relevant to molecular evolution, and encouraging more molecular evolutionists to incorporate biochemical thinking into their work.
Organizers: Joanna Masel, Claus Wilke

7. Everything That Rises Must Converge
Although evolution mostly proceeds by accumulation of differences between groups, convergent evolution can occur when similar solutions are found to common evolutionary problems. Well known morphological examples include eyes and wings, but examples are increasingly being found at the molecular level, including proteins involved in echolocation in bats and cetaceans, foregut fermentation proteins in monkeys and cows, transcription factors in mammals and birds, and mitochondrial proteins in snakes and agamid lizards. These examples suggest that convergent molecular evolution may be more common than previously thought. Undetected convergence can provide strong support for incorrect phylogenies, while evidence for convergence can inform on the adaptive landscape, the constraints acting on evolutionary processes, and the role of chance and necessity in evolution.
Our symposium will discuss specific examples and provide theoretical insights addressing: 1) How can adaptive convergence be identified and distinguished from other evolutionary processes? 2) When is convergence a problem for phylogenetic analysis? 3) What information does convergence provide about selective pressures and the process of adaptation? The diverse ramifications of convergent evolution will be interesting to those who want to understand basic principles of evolution as well as those who want more accurate phylogenetic trees.
Organizers: David Pollock, Richard Goldstein

8. Evolution of Protein Superfamilies: Origins, Structure and Function (Combined)
Reconstructing and interpreting the phylogenetic history of protein superfamilies (i.e., families that include paralogous genes) pose unique challenges. Phylogenetic accuracy is particularly difficult to achieve in the face of the extreme functional and structural variability observed in protein superfamilies spanning many paralogous clades. Post-hoc interpretation of phylogenies offers valuable insights into protein structure and function, but best practices have yet to be established for tasks such as predicting gene origins relative to key transitions in species evolution and integrating this knowledge with other attributes of genes, phylogenomic ortholog identification and elucidating the co-evolution of sequence, structure and function.
We welcome abstracts that tackle reconstructing and interpreting protein superfamily trees, including large-scale gene origin and ortholog analysis, novel benchmarking and simulation studies, domain-based phylogenies, reconstructing ancestral proteins, and other issues involved in protein superfamily analysis.
Organizers: Tony Capra, Dannie Durand, Toni Gabaldon, Christine Orengo, Kimmen Sjolander, Maureen Stolzer.

26. Mutation: The Ultimate Source of Molecular Variation
As biologists, we are interested in explaining variation at multiple levels: what makes species different, what makes individuals different, what makes genes different, etc. One cannot study such variation without understanding mutation. Mutations can result from errors during replication, mistakes during recombination, or from environmental factors. Although there are mutation rates estimated for taxa, mutation rates vary across the genome, and may vary with age. While point-mutations have been well studied, more complex events like indels or segmental-duplications have not.
Mutations are of great interest for studying population history and mechanisms of evolution. Mutations are also widely sought after as the causes of disease and phenotypic variation. However, mutations do not occur in a vacuum; genetic background influences mutations’ impact on fitness. In this symposium, we bring together researchers who study the origin and impact of mutations, including the influence of mutation patterns on evolution and disease.
Organizers: Reed A. Cartwright, Melissa A. Wilson Sayers

35. The Origin and Evolution of Early Life
Major transitions during the early history of life pose several challenges, from the origin of genetic systems and protein translation, to the core components of all cells and the last universal common ancestor (LUCA), followed by the radiation of the major cellular lineages. This symposium will explore key events during early evolution and the methodological advances that are bringing new insights to this important, but still poorly understood, period of evolutionary history.
The topics to be considered will include the evolution and optimization of the genetic code and the molecules that implement it; the nature, genome and metabolic potential of LUCA; the relationships among the major cellular lineages, and the evolutionary processes underlying their diversification. Recent developments in phylogenetic modeling, ancestral sequence reconstruction, gene tree/species tree reconciliation, and phylogenetic networks all promise to shed new light on early evolution, and this symposium will welcome these and other approaches to these fascinating and enduring problems.
Organizers: Tom Williams, Martin Embley, Steven E. Massey, Aaron Goldman.

36. The role of epistasis in molecular evolution
Epistasis (nonadditive interactions between mutations) can influence the rate and direction of evolutionary change, and is therefore of longstanding interest to evolutionary geneticists. In molecular evolution, insights into the form and prevalence of epistasis are relevant to fundamental questions about the topography of adaptive landscapes and the predictability of mutational pathways through sequence space. In recent years, microbial experimental evolution studies have demonstrated how epistasis shapes the structure of the genotype-fitness map, and directed-mutagenesis studies have documented epistasis between mutant sites in the same protein, revealing the direct causes of genetic constraints on adaptation and shedding light on the selective accessibility of alternative mutational paths to high-fitness genotypes.
Within the past couple of years, a number of high-profile papers have opened up fresh debates about the role of epistasis in molecular evolution (e.g., Breen et al. [2012] Nature 490: 535-538; McCandlish et al. [2013] Nature 497:E1-E2). It is clear that a symposium on this topic would be very timely. The purpose of the proposed symposium is to showcase recent theoretical and empirical advances in our understanding of epistasis and its influence on evolutionary mechanism and process. Our aim is to showcase work that tackles big questions and motivates new research directions.
Organizers: Jay F. Storz, Kristi L. Montooth

Wednesday, 20 November 2013

How evolutionary biologists can help biologists from other fields (especially the medical side).

In a recent post, Dan Graur made various criticism to a paper publish last year in PloS One:

and here is the link to the paper:

I like the concept of this paper, which is mixing various bioinformatics methods to identify evolutionary pressures on sites and trying to correlate them with disease mutations (see also our review on the subject ( and this recent one ( )

I have other comments (more recommendation) I would like to share in this blog:

- They used alignment from ClustalW, which is surprising as it is the oldest of these methods. Better methods have been publish since them, one can cite Muscle, Mafft, Probcons, T-coffee and Clustal-Omega. The authors compared some of them, but found that ClustalW was the best to use in their study. They could also have  tried some phylogeny-aware method to align (PRANK, PAGAN), to see if it makes a difference or not.

Recommendation number 1: try and/or use different alignment methods that have been proven to be robust.

- There is no indication on how they built the tree (NJ? ML? Bayes?). I tried to produce a tree, but I got the same results, with the Rat/Mouse clade also badly placed. These two sequences are highly divergent compared to the other sequences, so no surprise that even sophisticated algorithms produce strange results.

Recommendation number 2: while Minimum Evolution (FastME) and Neighbour-Joining (FastTree) can provide some good results, always try and/or use different methods to build a phylogenetic tree, especially with Maximum Likelihood (PhyML, RAxML) and Bayesian (MrBayes).

Recommendation number 3: always indicate all the details on the methods used: name of the program, release used, parameters used (other than default).

- They could have root the tree by Teleost fishes:

This is not wrong, but it would be much better to present it as rooted.

Recommendation number 4: always try to present your results in the light of evolution.

- The methods they used to analyse conservation/functional divergence are sensitive to the quality of the input, but I think they can tolerate a few topological error. For example, Pupko and Galtier find that their algorithm gives the same result with different trees:
"Several controls were performed to check the robustness of these results. Results were essentially unchanged when a different phylogenetic tree (Reyes et al. 2000) was used (not shown)."
As the algorithms are similar (i.e. Diverge), I would not be surprised to see a similar result with the taxonomic tree. Of course, as a reviewer, I would ask the authors to use the two alternative topologies: taxonomic tree and genetic tree.

- The sheep is badly placed, but considering the short branch, not really a surprise. And again, I don't think this will influence greatly the final result.

- However, my main worry is the power in their analysis, and what they tried to do. For example, they used DIVERGE like this: "DIVERGE site-specific evolutionary constraint values were computed using the depth 4 (vertebrate) data set only. DIVERGE was run by splitting the vertebrate phylogeny at the deepest node separating the fish from the terrestrial vertebrates, and Type II divergence values were recorded." So they are comparing if there is any sites under functional divergence between Fishes (3 species) and Tetrapodes (15 species).

There are three problems here:
1) They just analysed the deepest evolutionary event. If anything functional divergence happened later (i.e. in Mammals), we could not detect it.
2) They used a clearly unbalanced dataset (3 versus 15).
3) With 3 species in one side, they have clearly no power, no accuracy. The pattern of conservation (or divergence) can just be noise. Normally, at least four species are needed in each clade to see a significant pattern.

Recommendation number 5: always try to have a significant amount of data in order to have enough power. We are in the genomic era, so a wealth amount of sequence is available. For example, thousand of sequences enable to have significant power to predict 3D structures.

Recommendation number 6, and the most important: always try to understand what you are doing. Especially to see if the methods you are using are appropriate on your data. Most of evolutionary bioinformatics tools are not straightforward, and some are quite complex. For example, I have spent a significant part of my PhD to understand and implement CodeML/PAML. While I am quite confident in it, I think I have always things to learn in this field.

=> If you are not familiar with evolutionary bioinformatics tools, ask a colleague or an expert. They will be more than happy to help.

Wednesday, 24 July 2013

False discovery rate correction for multiple testing of positive selection [Pratictal]

When performing a test for the detection of positive selection (i.e. with CodeML/PAML) on multiple branches one by one, we need to correct your result to avoid false discovery rate (there are many methods (see the wiki page:

One I like is the QVALUE package, for R:

Launch R and install QVALUE by typing:
and load it:
We need to load the list of p-value into R. This is a single with only p-values. The order is really important, so we have in a separate file the ordered names of the branch and/or gene tested.

An file containing p-values from 760 gene families is provided here as an example:

Once we have our file of p-values, we can load it into R:
p <- scan("pvalues.list", na.strings=T)

First, have a look at the distribution:

hist(p, breaks = 20, main = paste("Distribution of p-values"), xlab="Value")

which will display this histogram:
The distribution is bimodal, as expected: when the test is negative, we will have a p-value = 1, and when the test is significant, the p-value is generally very low.

Now, we can perform the qvalue correction:
qobj <- qvalue(p, pi0.meth="bootstrap", fdr.level=0.05)

Two comments:
  1. The option “bootstrap” is really important, as the distribution is bimodal (see page 11 of the QVALUE manual).
  2. The fdr.level=0.05 is a threshold which estimates that 5% of our tests could be false-positive.

Finally, we write the output into a file:
qwrite(qobj, filename="qvalues.list")

The output file has three columns:
  1. The p-values, in the same order as in the input file.
  2. The corresponding q-value.
  3. The significance of the test, based on the threshold (0 or 1).

     pi0: 0.307017543859649

    FDR level: 0.05

    p-value q-value significant
    0.006233598 0.005675876 1
    0.1485256 0.06931197 0
    1 0.3070175 0
    0.06828507 0.03662801 1
    0.03221253 0.02015083 1
Now we can say which test is significant and which is not.

Please don't hesitate to contact me if you have any comments or questions.

Monday, 24 June 2013

Browsing the NCBI Taxonomy with Python

The Taxonomy browser is a fantastic tool to browser the Tree of Life:

I wrote a simple script in Python to browse it.

First, download the taxonomy archive in your folder:

Unpack it:
tar xvfz taxdump.tar.gz

It will output the following files:

Only "names.dmp and "nodes.dmp" are important for us.

Download the script from here and save it in the same folder:

Or copy it and save it as "" in the same folder:


import os
import sys

# Definition of the classe Node
class Node:
    def __init__(self):
        self.tax_id = 0       # Number of the tax id.
        self.parent = 0       # Number of the parent of this node
        self.children = []    # List of the children of this node
        self.tip = 0          # Tip=1 if it's a terminal node, 0 if not. = ""        # Name of the node: taxa if it's a terminal node, numero if not.      
    def genealogy(self):      # Trace genealogy from root to leaf
        ancestors = []        # Initialise the list of all nodes from root to leaf.
        tax_id = self.tax_id  # Define leaf
        while 1:
            if name_object.has_key(tax_id):
                tax_id = name_object[tax_id].parent
            if tax_id == "1":
                # If it is the root, we reached the end.
                # Add it to the list and break the loop
        return ancestors # Return the list

# Function to find common ancestor between two nodes or more
def common_ancestor(node_list):
    global name_object
    list1 = name_object[node_list[0]].genealogy()  # Define the whole genealogy of the first node
    for node in node_list:
        list2 = name_object[node].genealogy()      # Define the whole genealogy of the second node
        ancestral_list = []                            
        for i in list1:
            if i in list2:                         # Identify common nodes between the two genealogy
        list1 = ancestral_list                     # Reassing ancestral_list to list 1.
    common_ancestor = ancestral_list[0]            # Finally, the first node of the ancestra_list is the common ancestor of all nodes.
    return common_ancestor                         # Return a node

#                           #
#   Read taxonomy files     #
#                           #

# Load names defintion

name_dict = {}          # Initialise dictionary with TAX_ID:NAME
name_dict_reverse = {}  # Initialise dictionary with NAME:TAX_ID

# Load  NCBI names file ("names.dmp")
name_file =  open("names.dmp","r")
while 1:
    line = name_file.readline()
    if line == "":
    line = line.rstrip()
    line = line.replace("\t","")
    tab = line.split("|")
    if tab[3] == "scientific name":
        tax_id, name = tab[0], tab[1]     # Assign tax_id and name ...
        name_dict[tax_id] = name          # ... and load them
        name_dict_reverse[name] = tax_id  # ... into dictionaries

# Load taxonomy

# Define taxonomy variable
global name_object
name_object = {}

# Load taxonomy NCBI file ("nodes.dmp")
taxonomy_file = open("nodes.dmp","r")
while 1:
    line = taxonomy_file.readline()
    if line == "":
    #print line
    line = line.replace("\t","")
    tab = line.split("|")
    tax_id = str(tab[0])
    tax_id_parent = str(tab[1])
    division = str(tab[4])

    # Define name of the taxid
    name = "unknown"
    if tax_id in name_dict:
        name = name_dict[tax_id]
    if not name_object.has_key(tax_id):
        name_object[tax_id] = Node()
    name_object[tax_id].tax_id   = tax_id        # Assign tax_id
    name_object[tax_id].parent   = tax_id_parent # Assign tax_id parent
    name_object[tax_id].name     = name          # Assign name
    if  tax_id_parent in name_object:
        children = name_object[tax_id].children  # If parent is is already in the object
        children.append(tax_id)                  # ...we found its children.
        name_object[tax_id].children = children  # ... so add them to the parent

#                                               #
#    Exemple 1 : Evolutionary history of human  #
#                                               #

# But what is the tax_id of Human ???
tax_id_human = name_dict_reverse["Homo sapiens"]

print "Tax_id of Human (Homo sapiens) is: ", tax_id_human

# Ok, now define human genealogy...
human_genealogy = name_object[tax_id_human].genealogy()

#... and display it, with tax_id and name
for tax_id in human_genealogy:
    print  "Name: ", name_object[tax_id].name, " Tax_id: ",tax_id

#                                                                #
#    Exemple 2 : Common ancestor  between Trichoplax and human   #
#                                                                #

# What is the common ancestor between Trichoplax and Human?

# Trichoplax:

# Define the two nodes and add them to a list
tax_id_1 = name_dict_reverse["Trichoplax adhaerens"]
tax_id_2 = name_dict_reverse["Homo sapiens"]
list_of_nodes = [tax_id_1, tax_id_2]

# Identify the common ancestor
common_ancestor = common_ancestor(list_of_nodes)

print "The common ancestor between ",name_object[tax_id_1].name, " and ",name_object[tax_id_2].name, " is: ", name_object[common_ancestor].name

Then makes it executable:
chmod +x

Then execute it:

And enjoy!

It will do four things:
1) Load the assocation between Tax_id and Names ("names.dmp")
2) Load the Tree of Life ("nodes.dmp")
3) Display the genealogy from root to Human
4) Find the common ancestor between Trichoplax and Human.

Monday, 10 September 2012

ECCB'12: Tutorial on Protein evolution

During the ECCB'12 conference in Basel, Switzerland, I had the opportunity to organise and present a tutorial entitled "Protein Evolution: From Sequence to Structure to Function", along with Christine Orengo (UCL) and Nicholas Furnham (EMBL-EBI). This tutorial attracted 44 participants.

In the morning session, the participants were introduced to various softwares to identify functional shift in multiple alignments of protein sequences. These sofwares included BADASP, TDG09, FunDi and Diverge2. Jalview and PyMol were used to visualise the results.
They also used CodeML from the PAML package to inference positive selection during the evolution of nucleotide coding sequences. CodeML contains various codon substitution models that estimate different dN/dS. In particular the branch-site model was described in details.

The tutorial of this section is available here:

In the afternoon, Christine Orengo presented the fundamental concept of CATH. CATH is database of domain superfamilies initially identified using structural data and then expanded with predicted domain structures in genome sequences. Structural similarities are identified using the CATHEDRAL algorithm (Redfern, Orengo, 2007) and homologous relationships are then confirmed using sequence based HMM-HMM approaches (eg HHpred, Hildebrand 2009). CATH superfamilies are now also subclassified into functional families (FunFams) and new pages presenting these families and
conserved residues identified and projected onto representative structures were demoed during the session.

During the practical, Ian Sillitoe, one of the main CATH computational scientists, presented the new release of the CATH database.

The tutorial of this section is available here:

Finally, Nick Furnham presented FunTree, a new resource, developed in the laboratory of Prof. Janet Thornton at the European Bioinformatics Institute, in collaboration with Prof. Christine Orengo at UCL. It brings together sequence, structure, phylogenetic, chemical and mechanistic information for structurally defined enzyme superfamilies.
During the practical, he showed how the wide variety of data captured in FunTree from CATH/CATH-Gene3D, PDB, UniProtKB, ArchSchema, CSA, MACiE, ChEBI and KEGG is gathered, analysed and displayed.

The tutorial of this section is available here:

Thursday, 1 March 2012

Three-dimensional reconstruction of protein networks provides insight into human genetic disease

Last month, Wang et al. published in Nature Biotechnology an interesting study on computational network, protein structure modelling and diseases associated. The concept is a brillant illustration of how associating multiple genomics dataset into one:
  1. They took all information about protein-protein interactions.
  2. They took all interaction between domains (iPfam, 3did).
  3. Using these two datasets, they produced a high quality structurally resolved network (hSIN).
  4. They took genes involved in diseases.
  5. Finally, they mapped the interacting sites and diseases mutation onto the corresponding structures.

Different conclusions emerged (or confirmed previous results) from this study:

  • Non-synonymous SNPs in proteins involved in disease are randomly distributed, menaing that most SNP are non-disease related.
  • Genes can be involved in multiple unrelated diseases (pleiotopic effect). They could have mutations at one interface leading to one particular disease as well as they could have mutation to the opposite side (see Figure) or to another domain, so leading to another disease. One example illustrated is the WASP gene. WASP can interact with CDC42 and VASP. When WASP is mutated (in WH1 domain) to prevent binding with CDC42, it leads to X-linked neutropenia (XLN). When WASP is mutated (in PBD domain) to prevent binding with Wiskott-Aldrich (WAS) and/or X-linked thrombocytopenia (XLT).
  • Using their hSIN dataset, they predicted around 300 candidate genes for 700 unknow disease-to-gene associations.

Pleiotropic effect. Mutation 1 in protein A will lead to disease 1 by blocking binding to protein B. Mutation 2 in protein A will lead to disease 2 by blocking binding to protein C. (Figure inspired from original publication)


Wang X, Wei X, Thijssen B, Das J, Lipkin SM, Yu H.
Nat Biotechnol. 2012 Jan 15;30(2):159-64. doi: 10.1038/nbt.2106.