Wednesday, 25 March 2015

Tutorial: estimating the stability effect of a mutation with FoldX


Here is a brief tutorial on how to use FoldX to estimate the stability effect of a mutation in a 3D structure. The stability (ΔG) of a protein is defined by the free energy, which is express in kcal/mol. The lower it is, the more stable it is. G is difference of free energy between a wild-type and mutant. A mutation that brings energy (ΔΔG>kcal/mol) will destabilise the structure, while a mutation that remove energy (ΔΔG<kcal/mol) will stabilise the structure. A common threshold is to say that a mutation has a significant effect if ΔΔG is >1 kcal/mol, which roughly corresponds to a single hydrogen bond.

A way to compute the free energy is to use molecular dynamics. Main problem: it can be very time-consuming.

FoldX uses an empirical method to estimate the stability effect of a mutation. The executable is available here:

NB: I strongly encourage to read the manual (before or in parallel of this tutorial).

Foldx was used in many studies, i.e.:
Tokuriki N, Stricher F, Serrano L, Tawfik DS. How protein stability and new functions trade off. PLoS Comput Biol. 2008 Feb 29;4(2):e1000002

Dasmeh P, Serohijos AW, Kepp KP, Shakhnovich EI. Positively selected sites in cetacean myoglobins contribute to protein stability. PLoS Comput Biol. 2013;9(3):e1002929.
And I personally used it in three of my studies:
Studer RA, Christin PA, Williams MA, Orengo CA. Stability-activity tradeoffs constrain the adaptive evolution of RubisCO. Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2223-8.
Studer RA, Opperdoes FR, Nicolaes GA, Mulder AB, Mulder R. Understanding the functional difference between growth arrest-specific protein 6 and protein S: an evolutionary approach. Open Biol. 2014 Oct;4(10). pii: 140121.

Rallapalli PM, Orengo CA, Studer RA, Perkins SJ. Positive selection during the evolution of the blood coagulation factors in the context of their disease-causing mutations. Mol Biol Evol. 2014 Nov;31(11):3040-56.


The structure is a bacterial cytochrome P450 (PDB:4TVF). You can download it PDB file (4TVF.pdb) from here:

We would to test the stability of mutatinf the leucine (L) at position 280 to an aspartic acid (D).

Here is the original structure, with Leu280 in green, and residues around 6Å in yellow:

FoldX has different modes to run it, but I use the mode "runfile", which contains COMMANDS and OPTIONS in one single file.

FoldX works in two steps:

1) Repair the structure.

There are frequently problems in PDB structures, like steric clashes. FoldX will try to fix them and lower the global energy (ΔG). The command "RepairPDB" is better than the "Optimize" command. Here is the command file "foldx_repair.txt":
We indicate which PDB file it needs to use, that we want to repair it (<RepairPDB>), that it will use water and metal bonds from the PDB file (<water>-CRYSTAL; <metal>-CRYSTAL;) and that we want a PDB as output (<OutPDB>true).

You have to run the command file "foldx_repair.txt" like this:
foldx3b6 -runfile foldx_repair.txt
This process is quite long (around 10 minutes). Here is the result (the original structure is now in white, while the repaired structure is in yellow/green):

 We can see that some side chains have slightly move (in particular Phe16).

The starting free energy ΔG was 64.99 kcal/mol, and it was lowered to -48.15 kcal/mol, which is now stable (remember that a "+" sign means unstable while a "-" sign means stable).

Once it's finished, it will produce a file named "RepairPDB_4TVF.pdb", which you will use in the next step.

2) Perform the mutation

The mutation itself is perform by the BuildModel function. There are other methods, but the BuildModel is the most robust. You need also to specify the mutation in a separate file "individual_list.txt".

In the command file, you will see that is RepairPDB_4K33.pdb and not 4K33.pdb that is mutated. You will also notice "<numberOfRuns>3;". This is because some residues can have many rotamers and could have some convergence problems. You may to increase this values to 5 or 10, in case you are mutated long residues (i.e. Arginine) that have many rotamers.

Here the command file "foldx_build.txt":
and the "individual_list.txt" (just one line):
It contains the starting amino acid (L), the chain (A), the position (280) and the amino acid you want at the end (D). One line correspond to one mutant. It means you can mutate many residues at the same per line (mutant) and also  produce different mutants by different numbers of lines.

You can run it by:
foldx3b6 -runfile foldx_build.txt
It is much faster this time (i.e. a few seconds) and will produce many files.

FoldX will first mutate the target residue (L) to itself (L) and move it as well as all neighbouring side chains multiple times. We can see that Leu280 (green) was rotated:

=> This is will give the free energy of the wild-type (let's call it ΔGwt).

Then, it will mutate the target residue (L) to the desired mutant (D) and move it as well as all neighbouring side chains multiple times. We can see that Leu280 is mutated to Asp280 (see the two oxygen atoms in red):

=> This is will give the free energy of the mutant (let's call it ΔGmut).

The difference in free energy (ΔΔG) is given by ΔGmut-ΔGwt.

In the file "Raw_BuildModel_RepairPDB_4TVF.fxout", you can retrieve the energy of the three runs for both WT and Mutant.

  • ΔGmut = RepairPDB_4TVF_1_0.pdb = -41.1377 kcal/mol
  • ΔGwt = WT_RepairPDB_4TVF_1_0.pdb = -46.0464 kcal/mol
  • => ΔΔG = ΔGmut-ΔGwt = (-41.1377)-(-46.0464) = +4.9087 kcal/mol

One file contains the average difference over all runs: "Average_BuildModel_RepairPDB_4K33.fxout".
You will notice that the difference in free energy ΔΔG is +4.84 kcal/mol (+- 0.06 kcal/mol).

=> It means the mutation L280D is highly destabilising (positive value, and much above 1.0 kcal/mol). Here is the final mutant:

PS: Another way to define the threshold is to use the SD deviation multiple times:

The reported accuracy of FoldX is 0.46 kcal/mol (i.e., the SD of the difference
between ΔΔGs calculated by FoldX and the experimental values). We can bin the ΔΔG values into seven categories:
  1. highly stabilising (ΔΔG < −1.84 kcal/mol); 
  2. stabilising (−1.84 kcal/mol ≤ ΔΔG < −0.92 kcal/mol); 
  3. slightly stabilising (−0.92 kcal/mol ≤ ΔΔG < −0.46 kcal/mol); 
  4. neutral (−0.46 kcal/mol < ΔΔG ≤ +0.46 kcal/mol);
  5. slightly destabilising (+0.46 kcal/mol < ΔΔG ≤ +0.92 kcal/mol);
  6. destabilising (+0.92 kcal/mol < ΔΔG ≤ +1.84 kcal/mol);
  7. highly destabilising (ΔΔG > +1.84 kcal/mol).

Monday, 15 December 2014

Symposium on Protein Evolution and Structure - ESEB2015 - Lausanne Aug10-14

Dear colleagues,

We are delighted to invite submissions for the ESEB 2015 symposium, in Lausanne, Switzerland.

The symposium will focus on protein evolution in broad terms, including protein conservation and adaptation, detection of positive selection and co-evolution, structural evolution and stability constraints. We welcome submissions integrating studies on protein evolution with biochemistry and functional/structural genomics.

The symposium will take place during the 15th Congress of the European Society for Evolutionary Biology (ESEB) on 10 - 14 August 2015 in Lausanne, Switzerland.

    Deadline: 10 January 2015

    Dan Tawfik, Weizmann Institute of Science, Israel
    Richard Goldstein, University College London, UK

Proteins evolve by the replacement of amino acids (substitutions) or the insertion/deletion of fragments (indels). For the protein, mutations may be deleterious or beneficial, governed by the laws of natural selection. Beneficial mutations increase the fitness of the phenotype and are more likely to become fixed in the genome (positive selection). Proteins are not robust to drastic changes (i.e. important changes in stability) and mutations that favour an adaptive functional change are generally accompanied by other coevolving mutations that insure the integrity of the 3D structure (compensatory effect). All these biophysical properties are paving the way for protein evolution.
Traditionally, there was little if any crosstalk between the fields of protein biophysics, protein structure-function and molecular evolution. The last several years have seen some exciting development in merging these areas to obtain an in-depth understanding of how proteins evolve.
For example, a better understanding of how structural constraints affect protein evolution will greatly help to optimise stochastic models of sequence evolution. The symposium aims at exploring this new synthesis.

Abstracts will be selected for presentation by early March. When submitting your abstract please state your preference (talk, poster) during the submission process.
With questions please contact the symposium organizers:

*    Romain Studer (
*    Maria Anisimova (

We are looking forward to seeing you at ESEB 2015!

Romain and Maria


Romain Studer
European Bioinformatics Institute
Wellcome Trust Genome Campus, UK

Twitter: @RomainStuder

Maria Anisimova
Institute of Applied Simulations
Zurich University of Applied Sciences

Evolutionary Genomics:
Vol 1:
Vol 2:

Thursday, 25 September 2014

Tutorial on Ancestral Sequence Reconstruction

This tutorial was part of a course on protein evolution done during ECCB 2014 in Strasbourg:

NB: If there are any question, comments or bugs, feel free to ask. ;)


The slides of the introduction are available here:

In this practical, you will learn how to prepare files for CodeML, how to use it for reconstructing ancestral sequences and how to compute the isoelectric point of protein sequences. There are a few scripts you will need to download during the practical.

If you need to make these scripts executable, use chmod:
chmod +x

Tools used in this practical:

# Libraries for Python

# Package for ancestral sequence reconstruction (and many other things)
CodeML from PAML:

# Alignment tools
(Clustal-Omega is the new aligner from ClustalW team, but much faster and more accurate)

# Alignment visualisation

# Phylogenetics tools

# Tree visualisation

This practical will focus on the lysozyme, an enzyme (EC that damages bacterial cell walls. The Uniprot page is:
They evolved differently in Primates:







Step 1: Prepare alignment.

The quality of the ancestral reconstruction will heavily depend on the quality of the alignment and the tree topology (branch lengths are re-estimated during the reconstruction).

Please download the sequence file: lysozyme_primates.seq

Make a multiple alignment, either with Mafft-L-INS-i or Clustal-Omega:

mafft-linsi lysozyme_primates.seq > lysozyme_primates.fasta
clustal-omega-1.2.0-macosx --in lysozyme_primates.seq --out lysozyme_primates.fasta

(Please have a look at the alignment in Jalview)

The format of the resulting alignment is FASTA. However, most phylogenetic softwares use PHYLIP format. So, you have to convert it into PHYLIP.
Download the script  "" and execute it: lysozyme_primates.fasta lysozyme_primates.phy

(If you are not familiar, have a look at the differences between the alignment in FASTA and PHYLIP formats).

Step 2: Prepare alignment.

We can now generate a tree, either with PhyML (one of the most accurate tool) or FastTree (very fast and pretty accurate):

phyml -i lysozyme_primates.phy -d aa -m JTT -c 4 -a e -b 0
mv lysozyme_primates.phy_phyml_tree.txt lysozyme_primates.tree

Option used:
-i = input file
-d aa: amino acid sequences
-m JTT: (substitution matrix). JTT works fine for most proteins, but other matrices (WAG, LG) can do slightly better.
-c 4: (numbers of categories for the gamma distribution)
-a e: (estimate alpha parameter for the gamma distribution)
-b 0: (we don't want boostrap, as this will cause trouble for further analyses in CodeML).

or run:

FastTree -nosupport lysozyme_primates.phy > lysozyme_primates.tree

Option used:
-nosupport:(we don't want boostrap, as this will cause trouble for further analyses in CodeML).

Finally, we could root the tree. Use NJplot and root it by the group containing the Marmoset sequence (Callithrix jacchus).

Save it as "lysozyme_primates_rooted.tree"

Step 3: Run ancestral sequence reconstruction.

The ancestral sequence reconstruction is done by CodeML, from the PAML package.

It is launched with "codeml control_file.ctl"

You may have to copy the file "jones.dat" from the dat folder in the PAML package, or indicate its location.

The control file contains many parameters:

      seqfile = lysozyme_primates.phy    * sequence data filename
     treefile = lysozyme_primates_root.tree   * tree structure file name
      outfile = lysozyme_primates.mlc    * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen
      verbose = 2  * 0: concise; 1: detailed, 2: too much
      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

      seqtype = 2  * 1:codons; 2:AAs; 3:codons-->AAs

        clock = 0  * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
       aaDist = 0  * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
   aaRatefile = ./jones.dat  * only used for aa seqs with model=empirical(_F)

                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own

        model = 2
                   * models for codons:
                       * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                       * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                       * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)

        icode = 0  * 0:universal code; 1:mammalian mt; 2-10:see below
        Mgene = 0     * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff            
                   * AA: 0:rates, 1:separate

    fix_alpha = 0   * 0: estimate gamma shape parameter; 1: fix it at alpha
        alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 1   * different alphas for genes
        ncatG = 4   * # of categories in dG of NSsites models

        getSE = 0  * 0: don't want them, 1: want S.E.s of estimates

 RateAncestor = 1  * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)

   Small_Diff = .5e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
       method = 1  * Optimization method 0: simultaneous; 1: one branch a time

* Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt.,
* 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt.,
* 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt.,
* 10: blepharisma nu.
* These codes correspond to transl_table 1 to 11 of GENEBANK.

Explanation of some parameters:
runmode = 0 => We provide the tree.
clock = 0 => We don't set a molecular clock. We assume that the genes are evolving at different rate.
aaDist = 0 => We don't use the physicochemical properties of the amino acid.
aaRatefile = ./jones.dat => We use the JTT matrix. Other matrix could be used (WAG, etc...)
model = 2 => We use an empirical model (= substitutions matrix such as JTT).
fix_alpha = 0 => We estimated the alpha parameter of the gamma distribution.
alpha = 0.5 => We start the estimation from 0.5
RateAncestor = 1 => Force the estimation of ancestral states.
cleandata = 0 => Keep all ambigous data ("-", "X").

Please have a look at the output.

CodeML will also write into many files, but only two are of interest here:

  • lysozyme_primates.mlc => Contains many information on evolutionary rates.
  • rst => Contains ancestral states for sites and for nodes.
Please have a look at both files.

In the rst file, there is also how the tree has been annotated, under the line "tree with node labels for Rod Page's TreeView".
You can copy this tree into a file (i.e. "lysozyme_primates_annotated.tree"), open it with NJplot, and display the "bootstrap values".
We can see at which node correspond which ancestral sequence.

- What is the estimated alpha parameter of the gamma curve?
- How many categories where used? And what are their frequencies?
- In Jalview, we observed at column 68 a mixture of E(Glu), Q(Gln) and R(Arg). But what was the most likely state of this position in the last common ancestor of all sequences? What are the probabilities of A(AlA), E(Glu), Q(Gln) and R(Arg)?
- What are the evolutionary events (amino acid substitutions) at the basis of the Hominoidea clade (Hylobates lar, Gorilla, Pan Paniscus, Homo sapiens)?

Now it is time to extract ancestral sequences and put them in a file. The rst file is quite difficult to parse, hopefully, each ancestral sequence start by "node".

Download the following script and execute it:

./ rst

It displays ancestral sequences in FASTA format. Let's put them in a file:

./ rst > ancestral_sequences.fasta

Part 4: Compute physico-chemical properties on ancestral sequences.

In Biopython, there is a function to compute the isoelectric point (pI):

analysed_protein = ProtParam.ProteinAnalysis(sequence)
pI = analysed_protein.isoelectric_point()

The following script will compute the pI for all sequences in a FASTA file:

By launching it, we can retrieve the pI for modern primate lysozymes:

./ lysozyme_primates.fasta

And similarly for ancestral sequences:

./ ancestral_sequences.fasta

Part 5: Map properties on tree.

We could easily map ancestral properties on the tree. The tree provided in rst contains nodes where bootstrap information is.
We just need to change the values of these nodes by the corresponding pI.

Download the following script and execute it:

./ ancestral_sequences.fasta lysozyme_primates_annotated.tree >  lysozyme_primates_annotated_pI.tree

Have a look at both trees in a text editor.

You can load it in NJplot and see the different pI at the bootstrap place.

Alternatively, you can install FigTree:

and load the tree.

In FigTree, set the following parameters:
Appearance->Colour by:label
Setup: Colours  -> Scheme: Colour gradient
Tick gradient
Line Weight 4

Wednesday, 11 June 2014

Is there any example of study on amino acid sites under positive selection (dN/dS) that have been tested in vitro (confirmatory or not)?

(This post is more a open question than a direct answer).

Is there any example of study on amino acid sites under positive selection (dN/dS) that have been tested in vitro (confirmatory or not)?

By dN/dS, I mean any amino acid sites in any species that have been detected with CodeML, either the site models (M2a, M8), or the branch-site model. Or any other similar methods that aim to identify sites that could favour adaptation during evolutionary course.

By in vitro, I mean that these sites has been mutated in vitro and tested to see if they provide a different phenotype, or if the targeted protein exhibits different biochemical properties.

===>   I have the impression that there are plenty of studies that identifies such amino acids under positive selection in a large-scale manner, i.e.:

More genes underwent positive selection in chimpanzee evolution than in human evolution

Patterns of Positive Selection in Six Mammalian Genomes

Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes

Patterns of Positive Selection in Seven Ant Genomes

===> There are many studies that analyse these sites in silico by mapping them on the 3D structures, i.e.:

Patterns of Positive Selection in Six Mammalian Genomes

Adaptive Divergence of Ancient Gene Duplicates in the Avian MHC Class II β

Evolution of Genes Involved in Gamete Interaction: Evidence for Positive Selection, Duplications and Losses in Vertebrates

===> A few of them identifies the effect on the protein stability, i.e.:

Positively Selected Sites in Cetacean Myoglobins Contribute to Protein Stability

Stability-activity tradeoffs constrain the adaptive evolution of RubisCO

===> But I have difficulties to find studies that actually tested these sites in vitro. I found these ones:

Adaptive evolution of multicolored fluorescent proteins in reef-building corals.

Structural and Functional Evolution of Positively Selected Sites in Pine Glutathione S-Transferase Enzyme Family

But I am wondering if there are many other similar studies?

Thank you for your help!

Sunday, 1 June 2014

ECCB'14 - Tutorial - Protein Evolution Analysis: on the Use of Phylogenetic Trees

Hi all,

Just a small post to announce that me and two of my colleagues from EBI are doing a tutorial on the use of phylogenetics tree in evolutionary analysis (i.e. co-evolution between trees, codeml with biopython, ancestral sequence reconstruction). This will happen during ECCB 2014 in Strasbourg, France, on the Sunday September 7th.

All information available here:

See you there!

Evolution of RubisCO enzyme under structural constraints

(Reason of this post: I had to make a summary of my recent work for an application. So, here it is!)

Stability-activity trade-off constrains the adaptive evolution of RubisCO
Studer RA, Christin PA, Williams MA, Orengo CA.
Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2223-8. doi: 10.1073/pnas.1310811111.
Epub 2014 Jan 27. PMID:  24469821 Free PMC Article

Importance of the RubisCO enzyme
The ribulose-1,5-bisphosphate carboxylase/oxygenase [EC:] ]is the central enzyme in photosynthesis and one of the most abundant proteins in the world. The biological unit is a complex composed by eight large subunits and eight small subunits (Fig. 1A-B). Its catalytic activity is to fix the CO2 in the Calvin cycle and is performed by the large subunit (Fig. 1C). This reaction is extremely slow, with up to 3 molecules fixed per second. The most prevalent system in plants is the C3 photosynthesis pathway. In flowering plants (angiosperms), some lineages have evolved to a C4 photosynthesis pathway. In this pathway, the RubisCO is twice faster, with the ability to fix up to 6 molecules per second. A striking fact is that the emergence of this C4 photosynthesis pathway occurred in many divergent lineages, in a convergent manner. A major problem in the RubisCO is its dual affinity for CO2 and O2, which can lead to undesired photorespiration (fixation of O2), instead of photosynthesis (fixation of CO2). As this problem increases with the activity and can be costly to the plant, C4 plant lineages have also developed cellular mechanisms to concentrate the CO2 around the RubisCO, which prevent photorespiration.

Figure 1: Structural view of the RubisCO enzyme. A) Front view, with the large subunits in blue/yellow and the small subunits in purple; B) Top view, with the central solvent channel visible; C) Front view, with the two large subunits (in ribbons) forming the catalytic dimer; D) Sites detected under functional divergence.

Theoretical concepts in the evolution of proteins
Proteins are long chains of amino acids that tend to organise in the 3D structural space. Generally, proteins fold in a way to maximise the numbers of favourable atomic contacts, and to reduce the global free energy (ΔG, in kcal mol-1). However, enzymes need some degree of freedom, in order to move between different conformation, i.e. allowing the opening and closing of the catalytic pocket. Similarly, active sites are unfavourable in term of stability. This is why proteins are said to be marginally stable. The importance of the stability effect is seen when the replacement of an amino acid to another one occurs. Most amino acid replacements are very likely to disrupt the stability of the protein, and thus only a very few subset of amino acid changes is tolerated during evolution (Fig 2). However, it may happen that the change to another function, or the optimisation of a current function, can shift the stability towards its neutral area. This model is called stability-activity tradeoffs. These evolutionary events can be preceded by capacitive stabilising mutations and/or followed by compensating stabilising mutations.

Fig 2: The evolution of protein stability as a constrained “random walk” through sequence space. Protein sequences are represented as circles (yellow circles indicate sequences that are selectively neutral; red circles indicate those that have deleterious effects). Missense mutations are shown as the connecting labelled arrows. The series from 1 to 6 represents a trajectory of fixations through sequence space. The series of mutations from 1 to 3 represents a neutral “meandering” through sequence space. The adaptive fixation 4, which is advantageous despite its effects on stability and aggregation, induces a strong selection pressure for the compensating mutation 5 to restore stability to the neutral zone (reproduced from DePristo et al. 2005, Nat Rev Genet. 6(9):678-87).

Biological question
The RubisCO enzyme provides a perfect framework to study how enzymes evolved under structural constraints. In this study, I wanted to determine what are the residues responsible for the increase in catalytic activity, where are they located in the 3D structure and how they alter the stability of the complex.

Twelve amino acids are likely to be responsible for functional divergence
The phylogenetic-based algorithm TDG09 (Tamuri AU et al. 2009, PLoS Comput Biol 5(11): e1000564) aims to identify shift in selective pressures between groups of amino acids. Twelve sites have been identified (at 1% confidence) to be under strong selective pressure in RubisCO from C4 plant lineages compared to RubisCO from C4 plant lineages. None of these sites are part of the active site (which is, due to its importance, 100% conserved), but they are at different key positions in the 3D structure (Fig 1D), such as in the contact interface between subunit or in the opening/closing loop.

Reconstruction of ancestral sequences and structures
The intermediate sequences of RubisCO have been reconstructed under maximum likelihood, based on the phylogenetic trees of the 240 plant lineages. These sequences have been used to reconstruct the ancestral 3D structures by homology modelling. We have obtained a very high accuracy in each step, thanks to the extreme conservation of the RubisCO sequences (>90% of identity and no insertion/deletion, despite millions years of evolution) and the high quality of the crystal structure, which serves as template (1.35Å). We then were able to describe precisely all mutations that occurred at a particular time point, especially the contribution to the stability of these mutations. The stability effect of these substitutions (ΔΔG, in kcal mol-1) has been estimated by FoldX and the result has been mapped on the phylogenetic (Fig. 3).

Figure 3: Mapping of stability effect on the phylogenetics tree. This is an extract of the full phylogenetic tree, which has been built using the 240 sequences analysed in this study. C3 plants are in light green and C4 plants are in dark green. Blue slices indicate the percentage of stabilising mutations, while red slices indicate the Mutations L270I, A281S and A328S are frequently found in C4 and are destabilising. A328S is very close to the loop opening the catalytic pocket. A destabilising residue at this position could accelerate both the opening and closing of that loop. The mutation M309I is also frequently observed in C4, but has no effect in term of stability. Tree visualisation made with EvolView.

Stability-activity trade-off constrains the adaptive evolution of RubisCO
The comparison of the different C4 lineages led to the conclusion that the evolution of RubisCO to new environmental constraints (the C3->C4 transition) is constrained by stability-activity trade-offs (Fig. 4). Statistically speaking, there is a significant excess of destabilising mutations observed during the C3-C4 transition (p-value = 0.0080) and a significant excess of stabilising (compensatory) mutations observed right after the C3->C4 transition (p-value < 0.0001). While not statistically significant, we also observed an accumulation of slightly stabilizing mutations (which create the capacity to tolerate the functionally destabilizing mutations) by a long period before the C3->C4 transition

Figure 4: Frequency plot of mutations according to the positions relative to the transition C3->C4. Branches are annotated relative to the node where the transition C3->C4 occurred (position=0). Negative nodes are prior to the transition and positive nodes are after the transition. Interestingly, there is a peak a destabilising mutations (in orange) on the branches where the adaptation (transition C3->C4) occurred, followed by peak a stabilising mutations (in blue) on the posterior branches. This suggests that some mutations change the function (and destabilise the structure) and other mutations follow to compensate for this loss of stability.

Concluding remarks
These results demonstrated that the evolution of an enzyme, here the RubisCO, can be under strong structural constraints and that adaptive mutations are balanced between stabilising and destabilising effects. This shows that stability-activity trade-offs found in laboratory experiments (i.e. Bloom & Arnold 2009, PNAS 106:9995) have direct counterparts in the past 120 million years of plant evolution. A follow up of this project would be to extend this analytical framework to other enzymatic families where functional divergence is observed, and to apply this knowledge to directed-evolution experiment.

Thursday, 20 March 2014

Lampreys and Hagfishes are now reunited in a monophyletic clade (Cyclostomata) in the NCBI Taxonomy

The NCBI Taxonomy is a powerful resource and provides many tools to search for relationship between organism:

However, according to their disclaimer, they don't pretend to be an "authoritative source for nomenclature or classification". Many nodes are unresolved, and some nodes don't reflect recent changes. For exemple, the paraphyly of Hyperotreti (Myxines) and Hyperoartia (Sea lampreys): 

In this context, Hagfishes are not considered as proper Vertebrates, but only Craniata. However, they are all Vertebrates and Hagfishes and Lampreys should be clustered in a monophyletic clade called Cyclostomata ("round mouth").

We can suggest to the NCBI team to make appropriate changes. For example, the introduction of the Dipnotetrapodomorpha clade, which group Dipnoi and Tetrapoda and let the Coelacanth clade as an outgroup:

I wrote to NCBI the following email (see the following extracts), and they kindly replied and followed my suggestions:

I am writing to request a few changes in the NCBI Taxonomy, in order to reflect recent findings in the phylogeny of the basal Vertebrates.

The current NCBI taxonomy at the basis of Chordata (7711) assumes a paraphyletic relationship of Sea Lamprey (Hyperoartia, 117569) and Hagfishes (Hyperotreti, 117565):

However, recent publications strongly support the monophyly group of Hyperoartia and Hyperotreti, in a clade named “Cyclostomata”. This clade is a sister-clade of the jawed vertebrates “Gnathostomata”:
Nature. 2014 Feb 12. doi: 10.1038/nature12980.
A primitive placoderm sheds light on the origin of the jawed vertebrate face.
Dupret V1, Sanchez S2, Goujet D3, Tafforeau P4, Ahlberg PE1.
Development. 2012 Jun;139(12):2091-9. doi: 10.1242/dev.074716.
Evolutionary crossroads in developmental biology: cyclostomes (lamprey and hagfish).
Shimeld SM1, Donoghue PC.
Proc Biol Sci. 2011 Apr 22;278(1709):1150-7. doi: 10.1098/rspb.2010.1641.
Decay of vertebrate characters in hagfish and lamprey (Cyclostomata) and the implications for the vertebrate fossil record.
Sansom RS1, Gabbott SE, Purnell MA.
Proc Natl Acad Sci U S A. 2010 Nov 9;107(45):19137-8. doi: 10.1073/pnas.1014583107.
microRNAs revive old views about jawless vertebrate divergence and evolution.
Janvier P.
Proc Natl Acad Sci U S A. 2010 Nov 9;107(45):19379-83. doi: 10.1073/pnas.1010350107.
microRNAs reveal the interrelationships of hagfish, lampreys, and gnathostomes and the nature of the ancestral vertebrate.
Heimberg AM1, Cowper-Sal-lari R, Sémon M, Donoghue PC, Peterson KJ.
Mol Biol Evol. 2009 Jan;26(1):47-59. doi: 10.1093/molbev/msn222.
Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after?
Kuraku S, Meyer A, Kuratani S.

I propose the following changes in the NCBI taxonomy:

- The “Craniata” clade (89593) should be replaced by the “Vertebrata” clade (7742). Craniata could eventually be used as a synonym of Vertebrata. 
- A “Cyclostomata” clade should be created within the “Vertebrata” clade (7742). This “Cyclostomata” clade would be a sister group of the “Gnathostomata” clade (7776).

- The “Hyperotreti” clade (117565) should be moved into the new “Cyclostomata” clade.

- Similarly, the “Hyperoartia” clade (117569) should be moved into the new “Cyclostomata” clade.

The new taxonomy would look like:

- Vertebrata (vertebrates)
--- Cyclostomata (living agnathans)
------ Hyperotreti (fish)
------------ Myxiniformes
------ Hyperoartia (fish)
------------ Petromyzontiformes
--- Gnathostomata (jawed vertebrates)
------ Chondrichthyes (cartilaginous fishes)
------ Teleostomi

We are using the NCBI taxonomy to automatize the clustering of genes based on the tree of life. It would be very useful for us if the NCBI taxonomy can reflect the recent supports of the monophyly of Hyperoartia and Hyperotreti in the Cyclostomata clade.

And here is the result a few days later:

Nice, isn't it ?

And the information is also transferered in the dump files:

The old one:
grep "Cyclostomata" names.dmp
97265    |    Cyclostomata                        |                |    synonym        |
97265    |    Cyclostomata Busk , 1852    |                |    authority        |
The new one:
grep "Cyclostomata" names.dmp
97265    |    Cyclostomata                       |    Cyclostomata <bryozoan>    |    synonym        |
97265    |    Cyclostomata Busk , 1852    |                |    authority    |
1476529    |    Cyclostomata              |    Cyclostomata <chordate>    |    scientific name    |

If all systematicians and other biologists around the world join their efforts and contribute to this resource, it will become a very accurate and powerful tool to assist molecular evolution analyses.