Thursday, 25 September 2014

Tutorial on Ancestral Sequence Reconstruction

This tutorial was part of a course on protein evolution done during ECCB 2014 in Strasbourg: http://www.eccb14.org/program/tutorials/pea


NB: If there are any question, comments or bugs, feel free to ask. ;)



Introduction

The slides of the introduction are available here:
ancestral_sequence_reconstruction.pdf


In this practical, you will learn how to prepare files for CodeML, how to use it for reconstructing ancestral sequences and how to compute the isoelectric point of protein sequences. There are a few scripts you will need to download during the practical.

If you need to make these scripts executable, use chmod:
chmod +x script.py


Tools used in this practical:

# Libraries for Python
Biopython: http://biopython.org/wiki/Main_Page

# Package for ancestral sequence reconstruction (and many other things)
CodeML from PAML: http://abacus.gene.ucl.ac.uk/software/paml.html

# Alignment tools
MAFFT L-INS-i: http://mafft.cbrc.jp/alignment/software/
Clustal-Omega: http://www.clustal.org/omega/
(Clustal-Omega is the new aligner from ClustalW team, but much faster and more accurate)

# Alignment visualisation
Jalview: http://www.jalview.org/

# Phylogenetics tools
PhyML: http://www.atgc-montpellier.fr/phyml/binaries.php
FastTree: http://www.microbesonline.org/fasttree/

# Tree visualisation
NJplot: http://doua.prabi.fr/software/njplot



This practical will focus on the lysozyme, an enzyme (EC 3.2.1.17) that damages bacterial cell walls. The Uniprot page is: http://www.uniprot.org/uniprot/P61626
They evolved differently in Primates:

















 

 

 

 

 

 

Step 1: Prepare alignment.

The quality of the ancestral reconstruction will heavily depend on the quality of the alignment and the tree topology (branch lengths are re-estimated during the reconstruction).

Please download the sequence file: lysozyme_primates.seq

Make a multiple alignment, either with Mafft-L-INS-i or Clustal-Omega:

mafft-linsi lysozyme_primates.seq > lysozyme_primates.fasta
or
clustal-omega-1.2.0-macosx --in lysozyme_primates.seq --out lysozyme_primates.fasta

(Please have a look at the alignment in Jalview)


The format of the resulting alignment is FASTA. However, most phylogenetic softwares use PHYLIP format. So, you have to convert it into PHYLIP.
Download the script  "convert_fasta2phylip.py" and execute it:

convert_fasta2phylip.py lysozyme_primates.fasta lysozyme_primates.phy

(If you are not familiar, have a look at the differences between the alignment in FASTA and PHYLIP formats).


Step 2: Prepare alignment.

We can now generate a tree, either with PhyML (one of the most accurate tool) or FastTree (very fast and pretty accurate):

phyml -i lysozyme_primates.phy -d aa -m JTT -c 4 -a e -b 0
mv lysozyme_primates.phy_phyml_tree.txt lysozyme_primates.tree

Option used:
-i = input file
-d aa: amino acid sequences
-m JTT: (substitution matrix). JTT works fine for most proteins, but other matrices (WAG, LG) can do slightly better.
-c 4: (numbers of categories for the gamma distribution)
-a e: (estimate alpha parameter for the gamma distribution)
-b 0: (we don't want boostrap, as this will cause trouble for further analyses in CodeML).


or run:

FastTree -nosupport lysozyme_primates.phy > lysozyme_primates.tree

Option used:
-nosupport:(we don't want boostrap, as this will cause trouble for further analyses in CodeML).


Finally, we could root the tree. Use NJplot and root it by the group containing the Marmoset sequence (Callithrix jacchus).

Save it as "lysozyme_primates_rooted.tree"


Step 3: Run ancestral sequence reconstruction.


The ancestral sequence reconstruction is done by CodeML, from the PAML package.

It is launched with "codeml control_file.ctl"

You may have to copy the file "jones.dat" from the dat folder in the PAML package, or indicate its location.



The control file contains many parameters:

      seqfile = lysozyme_primates.phy    * sequence data filename
     treefile = lysozyme_primates_root.tree   * tree structure file name
      outfile = lysozyme_primates.mlc    * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen
      verbose = 2  * 0: concise; 1: detailed, 2: too much
      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

      seqtype = 2  * 1:codons; 2:AAs; 3:codons-->AAs

        clock = 0  * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
       aaDist = 0  * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
   aaRatefile = ./jones.dat  * only used for aa seqs with model=empirical(_F)

                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own

        model = 2
                   * models for codons:
                       * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                       * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                       * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)

        icode = 0  * 0:universal code; 1:mammalian mt; 2-10:see below
        Mgene = 0     * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff            
                   * AA: 0:rates, 1:separate

    fix_alpha = 0   * 0: estimate gamma shape parameter; 1: fix it at alpha
        alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 1   * different alphas for genes
        ncatG = 4   * # of categories in dG of NSsites models


        getSE = 0  * 0: don't want them, 1: want S.E.s of estimates

 RateAncestor = 1  * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)

   Small_Diff = .5e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
       method = 1  * Optimization method 0: simultaneous; 1: one branch a time


* Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt.,
* 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt.,
* 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt.,
* 10: blepharisma nu.
* These codes correspond to transl_table 1 to 11 of GENEBANK.


Explanation of some parameters:
runmode = 0 => We provide the tree.
clock = 0 => We don't set a molecular clock. We assume that the genes are evolving at different rate.
aaDist = 0 => We don't use the physicochemical properties of the amino acid.
aaRatefile = ./jones.dat => We use the JTT matrix. Other matrix could be used (WAG, etc...)
model = 2 => We use an empirical model (= substitutions matrix such as JTT).
fix_alpha = 0 => We estimated the alpha parameter of the gamma distribution.
alpha = 0.5 => We start the estimation from 0.5
RateAncestor = 1 => Force the estimation of ancestral states.
cleandata = 0 => Keep all ambigous data ("-", "X").



Please have a look at the output.

CodeML will also write into many files, but only two are of interest here:

  • lysozyme_primates.mlc => Contains many information on evolutionary rates.
  • rst => Contains ancestral states for sites and for nodes.
Please have a look at both files.


In the rst file, there is also how the tree has been annotated, under the line "tree with node labels for Rod Page's TreeView".
You can copy this tree into a file (i.e. "lysozyme_primates_annotated.tree"), open it with NJplot, and display the "bootstrap values".
We can see at which node correspond which ancestral sequence.


Questions:
- What is the estimated alpha parameter of the gamma curve?
- How many categories where used? And what are their frequencies?
- In Jalview, we observed at column 68 a mixture of E(Glu), Q(Gln) and R(Arg). But what was the most likely state of this position in the last common ancestor of all sequences? What are the probabilities of A(AlA), E(Glu), Q(Gln) and R(Arg)?
- What are the evolutionary events (amino acid substitutions) at the basis of the Hominoidea clade (Hylobates lar, Gorilla, Pan Paniscus, Homo sapiens)?


Now it is time to extract ancestral sequences and put them in a file. The rst file is quite difficult to parse, hopefully, each ancestral sequence start by "node".

Download the following script and execute it: parse_rst.py

./parse_rst.py rst

It displays ancestral sequences in FASTA format. Let's put them in a file:

./parse_rst.py rst > ancestral_sequences.fasta



Part 4: Compute physico-chemical properties on ancestral sequences.


In Biopython, there is a function to compute the isoelectric point (pI):

analysed_protein = ProtParam.ProteinAnalysis(sequence)
pI = analysed_protein.isoelectric_point()


The following script will compute the pI for all sequences in a FASTA file: compute_pI.py

By launching it, we can retrieve the pI for modern primate lysozymes:

./compute_pI.py lysozyme_primates.fasta

And similarly for ancestral sequences:

./compute_pI.py ancestral_sequences.fasta



Part 5: Map properties on tree.

We could easily map ancestral properties on the tree. The tree provided in rst contains nodes where bootstrap information is.
We just need to change the values of these nodes by the corresponding pI.

Download the following script and execute it: map_on_tree.py

./map_on_tree.py ancestral_sequences.fasta lysozyme_primates_annotated.tree >  lysozyme_primates_annotated_pI.tree

Have a look at both trees in a text editor.

You can load it in NJplot and see the different pI at the bootstrap place.


Alternatively, you can install FigTree: http://tree.bio.ed.ac.uk/software/figtree/

and load the tree.

In FigTree, set the following parameters:
Appearance->Colour by:label
Setup: Colours  -> Scheme: Colour gradient
Tick gradient
Line Weight 4

Wednesday, 11 June 2014

Is there any example of study on amino acid sites under positive selection (dN/dS) that have been tested in vitro (confirmatory or not)?

(This post is more a open question than a direct answer).


Is there any example of study on amino acid sites under positive selection (dN/dS) that have been tested in vitro (confirmatory or not)?

By dN/dS, I mean any amino acid sites in any species that have been detected with CodeML, either the site models (M2a, M8), or the branch-site model. Or any other similar methods that aim to identify sites that could favour adaptation during evolutionary course.

By in vitro, I mean that these sites has been mutated in vitro and tested to see if they provide a different phenotype, or if the targeted protein exhibits different biochemical properties.


===>   I have the impression that there are plenty of studies that identifies such amino acids under positive selection in a large-scale manner, i.e.:

More genes underwent positive selection in chimpanzee evolution than in human evolution
http://www.pnas.org/content/104/18/7489.full

Patterns of Positive Selection in Six Mammalian Genomes
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000144

Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes
http://genome.cshlp.org/content/18/9/1393.short

Patterns of Positive Selection in Seven Ant Genomes
http://mbe.oxfordjournals.org/content/early/2014/05/06/molbev.msu141.abstract



===> There are many studies that analyse these sites in silico by mapping them on the 3D structures, i.e.:

Patterns of Positive Selection in Six Mammalian Genomes
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000144

Adaptive Divergence of Ancient Gene Duplicates in the Avian MHC Class II β
http://mbe.oxfordjournals.org/content/27/10/2360

Evolution of Genes Involved in Gamete Interaction: Evidence for Positive Selection, Duplications and Losses in Vertebrates
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0044548


===> A few of them identifies the effect on the protein stability, i.e.:

Positively Selected Sites in Cetacean Myoglobins Contribute to Protein Stability
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002929

Stability-activity tradeoffs constrain the adaptive evolution of RubisCO
http://www.pnas.org/content/111/6/2223.long


===> But I have difficulties to find studies that actually tested these sites in vitro. I found these ones:

Adaptive evolution of multicolored fluorescent proteins in reef-building corals.
http://link.springer.com/article/10.1007%2Fs00239-005-0129-9

Structural and Functional Evolution of Positively Selected Sites in Pine Glutathione S-Transferase Enzyme Family
http://www.jbc.org/content/288/34/24441.full


But I am wondering if there are many other similar studies?


Thank you for your help!

Sunday, 1 June 2014

ECCB'14 - Tutorial - Protein Evolution Analysis: on the Use of Phylogenetic Trees

Hi all,

Just a small post to announce that me and two of my colleagues from EBI are doing a tutorial on the use of phylogenetics tree in evolutionary analysis (i.e. co-evolution between trees, codeml with biopython, ancestral sequence reconstruction). This will happen during ECCB 2014 in Strasbourg, France, on the Sunday September 7th.

All information available here:




See you there!


Evolution of RubisCO enzyme under structural constraints

(Reason of this post: I had to make a summary of my recent work for an application. So, here it is!)



Stability-activity trade-off constrains the adaptive evolution of RubisCO
Studer RA, Christin PA, Williams MA, Orengo CA.
Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2223-8. doi: 10.1073/pnas.1310811111.
Epub 2014 Jan 27. PMID:  24469821 Free PMC Article


Importance of the RubisCO enzyme
The ribulose-1,5-bisphosphate carboxylase/oxygenase [EC: 4.1.1.39] ]is the central enzyme in photosynthesis and one of the most abundant proteins in the world. The biological unit is a complex composed by eight large subunits and eight small subunits (Fig. 1A-B). Its catalytic activity is to fix the CO2 in the Calvin cycle and is performed by the large subunit (Fig. 1C). This reaction is extremely slow, with up to 3 molecules fixed per second. The most prevalent system in plants is the C3 photosynthesis pathway. In flowering plants (angiosperms), some lineages have evolved to a C4 photosynthesis pathway. In this pathway, the RubisCO is twice faster, with the ability to fix up to 6 molecules per second. A striking fact is that the emergence of this C4 photosynthesis pathway occurred in many divergent lineages, in a convergent manner. A major problem in the RubisCO is its dual affinity for CO2 and O2, which can lead to undesired photorespiration (fixation of O2), instead of photosynthesis (fixation of CO2). As this problem increases with the activity and can be costly to the plant, C4 plant lineages have also developed cellular mechanisms to concentrate the CO2 around the RubisCO, which prevent photorespiration.


Figure 1: Structural view of the RubisCO enzyme. A) Front view, with the large subunits in blue/yellow and the small subunits in purple; B) Top view, with the central solvent channel visible; C) Front view, with the two large subunits (in ribbons) forming the catalytic dimer; D) Sites detected under functional divergence.

Theoretical concepts in the evolution of proteins
Proteins are long chains of amino acids that tend to organise in the 3D structural space. Generally, proteins fold in a way to maximise the numbers of favourable atomic contacts, and to reduce the global free energy (ΔG, in kcal mol-1). However, enzymes need some degree of freedom, in order to move between different conformation, i.e. allowing the opening and closing of the catalytic pocket. Similarly, active sites are unfavourable in term of stability. This is why proteins are said to be marginally stable. The importance of the stability effect is seen when the replacement of an amino acid to another one occurs. Most amino acid replacements are very likely to disrupt the stability of the protein, and thus only a very few subset of amino acid changes is tolerated during evolution (Fig 2). However, it may happen that the change to another function, or the optimisation of a current function, can shift the stability towards its neutral area. This model is called stability-activity tradeoffs. These evolutionary events can be preceded by capacitive stabilising mutations and/or followed by compensating stabilising mutations.

Fig 2: The evolution of protein stability as a constrained “random walk” through sequence space. Protein sequences are represented as circles (yellow circles indicate sequences that are selectively neutral; red circles indicate those that have deleterious effects). Missense mutations are shown as the connecting labelled arrows. The series from 1 to 6 represents a trajectory of fixations through sequence space. The series of mutations from 1 to 3 represents a neutral “meandering” through sequence space. The adaptive fixation 4, which is advantageous despite its effects on stability and aggregation, induces a strong selection pressure for the compensating mutation 5 to restore stability to the neutral zone (reproduced from DePristo et al. 2005, Nat Rev Genet. 6(9):678-87).


Biological question
The RubisCO enzyme provides a perfect framework to study how enzymes evolved under structural constraints. In this study, I wanted to determine what are the residues responsible for the increase in catalytic activity, where are they located in the 3D structure and how they alter the stability of the complex.

Twelve amino acids are likely to be responsible for functional divergence
The phylogenetic-based algorithm TDG09 (Tamuri AU et al. 2009, PLoS Comput Biol 5(11): e1000564) aims to identify shift in selective pressures between groups of amino acids. Twelve sites have been identified (at 1% confidence) to be under strong selective pressure in RubisCO from C4 plant lineages compared to RubisCO from C4 plant lineages. None of these sites are part of the active site (which is, due to its importance, 100% conserved), but they are at different key positions in the 3D structure (Fig 1D), such as in the contact interface between subunit or in the opening/closing loop.

Reconstruction of ancestral sequences and structures
The intermediate sequences of RubisCO have been reconstructed under maximum likelihood, based on the phylogenetic trees of the 240 plant lineages. These sequences have been used to reconstruct the ancestral 3D structures by homology modelling. We have obtained a very high accuracy in each step, thanks to the extreme conservation of the RubisCO sequences (>90% of identity and no insertion/deletion, despite millions years of evolution) and the high quality of the crystal structure, which serves as template (1.35Å). We then were able to describe precisely all mutations that occurred at a particular time point, especially the contribution to the stability of these mutations. The stability effect of these substitutions (ΔΔG, in kcal mol-1) has been estimated by FoldX and the result has been mapped on the phylogenetic (Fig. 3).

Figure 3: Mapping of stability effect on the phylogenetics tree. This is an extract of the full phylogenetic tree, which has been built using the 240 sequences analysed in this study. C3 plants are in light green and C4 plants are in dark green. Blue slices indicate the percentage of stabilising mutations, while red slices indicate the Mutations L270I, A281S and A328S are frequently found in C4 and are destabilising. A328S is very close to the loop opening the catalytic pocket. A destabilising residue at this position could accelerate both the opening and closing of that loop. The mutation M309I is also frequently observed in C4, but has no effect in term of stability. Tree visualisation made with EvolView.


Stability-activity trade-off constrains the adaptive evolution of RubisCO
The comparison of the different C4 lineages led to the conclusion that the evolution of RubisCO to new environmental constraints (the C3->C4 transition) is constrained by stability-activity trade-offs (Fig. 4). Statistically speaking, there is a significant excess of destabilising mutations observed during the C3-C4 transition (p-value = 0.0080) and a significant excess of stabilising (compensatory) mutations observed right after the C3->C4 transition (p-value < 0.0001). While not statistically significant, we also observed an accumulation of slightly stabilizing mutations (which create the capacity to tolerate the functionally destabilizing mutations) by a long period before the C3->C4 transition

Figure 4: Frequency plot of mutations according to the positions relative to the transition C3->C4. Branches are annotated relative to the node where the transition C3->C4 occurred (position=0). Negative nodes are prior to the transition and positive nodes are after the transition. Interestingly, there is a peak a destabilising mutations (in orange) on the branches where the adaptation (transition C3->C4) occurred, followed by peak a stabilising mutations (in blue) on the posterior branches. This suggests that some mutations change the function (and destabilise the structure) and other mutations follow to compensate for this loss of stability.


Concluding remarks
These results demonstrated that the evolution of an enzyme, here the RubisCO, can be under strong structural constraints and that adaptive mutations are balanced between stabilising and destabilising effects. This shows that stability-activity trade-offs found in laboratory experiments (i.e. Bloom & Arnold 2009, PNAS 106:9995) have direct counterparts in the past 120 million years of plant evolution. A follow up of this project would be to extend this analytical framework to other enzymatic families where functional divergence is observed, and to apply this knowledge to directed-evolution experiment.

Thursday, 20 March 2014

Lampreys and Hagfishes are now reunited in a monophyletic clade (Cyclostomata) in the NCBI Taxonomy

The NCBI Taxonomy is a powerful resource and provides many tools to search for relationship between organism:
http://www.ncbi.nlm.nih.gov/taxonomy

However, according to their disclaimer, they don't pretend to be an "authoritative source for nomenclature or classification". Many nodes are unresolved, and some nodes don't reflect recent changes. For exemple, the paraphyly of Hyperotreti (Myxines) and Hyperoartia (Sea lampreys): 


In this context, Hagfishes are not considered as proper Vertebrates, but only Craniata. However, they are all Vertebrates and Hagfishes and Lampreys should be clustered in a monophyletic clade called Cyclostomata ("round mouth").

We can suggest to the NCBI team to make appropriate changes. For example, the introduction of the Dipnotetrapodomorpha clade, which group Dipnoi and Tetrapoda and let the Coelacanth clade as an outgroup:
http://people.unil.ch/marcrobinsonrechavi/2013/05/dont-complain-about-ncbi-taxonomy-improve-it/
http://bgeedb.wordpress.com/2013/05/29/new-taxon-dipnotetrapodomorpha-in-ncbi-taxonomy/



I wrote to NCBI the following email (see the following extracts), and they kindly replied and followed my suggestions:

I am writing to request a few changes in the NCBI Taxonomy, in order to reflect recent findings in the phylogeny of the basal Vertebrates.


The current NCBI taxonomy at the basis of Chordata (7711) assumes a paraphyletic relationship of Sea Lamprey (Hyperoartia, 117569) and Hagfishes (Hyperotreti, 117565):


However, recent publications strongly support the monophyly group of Hyperoartia and Hyperotreti, in a clade named “Cyclostomata”. This clade is a sister-clade of the jawed vertebrates “Gnathostomata”:

http://www.ncbi.nlm.nih.gov/pubmed/24522530
Nature. 2014 Feb 12. doi: 10.1038/nature12980.
A primitive placoderm sheds light on the origin of the jawed vertebrate face.
Dupret V1, Sanchez S2, Goujet D3, Tafforeau P4, Ahlberg PE1.

http://www.ncbi.nlm.nih.gov/pubmed/22619386
Development. 2012 Jun;139(12):2091-9. doi: 10.1242/dev.074716.
Evolutionary crossroads in developmental biology: cyclostomes (lamprey and hagfish).
Shimeld SM1, Donoghue PC.

http://www.ncbi.nlm.nih.gov/pubmed/20947532
Proc Biol Sci. 2011 Apr 22;278(1709):1150-7. doi: 10.1098/rspb.2010.1641.
Decay of vertebrate characters in hagfish and lamprey (Cyclostomata) and the implications for the vertebrate fossil record.
Sansom RS1, Gabbott SE, Purnell MA.

http://www.ncbi.nlm.nih.gov/pubmed/21041649
Proc Natl Acad Sci U S A. 2010 Nov 9;107(45):19137-8. doi: 10.1073/pnas.1014583107.
microRNAs revive old views about jawless vertebrate divergence and evolution.
Janvier P.
http://www.ncbi.nlm.nih.gov/pubmed/20959416
Proc Natl Acad Sci U S A. 2010 Nov 9;107(45):19379-83. doi: 10.1073/pnas.1010350107.
microRNAs reveal the interrelationships of hagfish, lampreys, and gnathostomes and the nature of the ancestral vertebrate.
Heimberg AM1, Cowper-Sal-lari R, Sémon M, Donoghue PC, Peterson KJ.

http://www.ncbi.nlm.nih.gov/pubmed/18842688
Mol Biol Evol. 2009 Jan;26(1):47-59. doi: 10.1093/molbev/msn222.
Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after?
Kuraku S, Meyer A, Kuratani S.

I propose the following changes in the NCBI taxonomy:

- The “Craniata” clade (89593) should be replaced by the “Vertebrata” clade (7742). Craniata could eventually be used as a synonym of Vertebrata. 
- A “Cyclostomata” clade should be created within the “Vertebrata” clade (7742). This “Cyclostomata” clade would be a sister group of the “Gnathostomata” clade (7776).

- The “Hyperotreti” clade (117565) should be moved into the new “Cyclostomata” clade.

- Similarly, the “Hyperoartia” clade (117569) should be moved into the new “Cyclostomata” clade.

The new taxonomy would look like:

- Vertebrata (vertebrates)
--- Cyclostomata (living agnathans)
------ Hyperotreti (fish)
------------ Myxiniformes
------ Hyperoartia (fish)
------------ Petromyzontiformes
--- Gnathostomata (jawed vertebrates)
------ Chondrichthyes (cartilaginous fishes)
------ Teleostomi


We are using the NCBI taxonomy to automatize the clustering of genes based on the tree of life. It would be very useful for us if the NCBI taxonomy can reflect the recent supports of the monophyly of Hyperoartia and Hyperotreti in the Cyclostomata clade.



And here is the result a few days later:
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=89593&lvl=3&p=mapview&p=has_linkout&p=blast_url&p=genome_blast&srchmode=1&keep=1&unlock





Nice, isn't it ?

And the information is also transferered in the dump files:

The old one:
grep "Cyclostomata" names.dmp
97265    |    Cyclostomata                        |                |    synonym        |
97265    |    Cyclostomata Busk , 1852    |                |    authority        |
The new one:
grep "Cyclostomata" names.dmp
97265    |    Cyclostomata                       |    Cyclostomata <bryozoan>    |    synonym        |
97265    |    Cyclostomata Busk , 1852    |                |    authority    |
1476529    |    Cyclostomata              |    Cyclostomata <chordate>    |    scientific name    |



If all systematicians and other biologists around the world join their efforts and contribute to this resource, it will become a very accurate and powerful tool to assist molecular evolution analyses.









Friday, 20 December 2013

SMBE 2014 - Symposia with a particular focus on protein evolution

The list of Symposia for the SMBE (Society for Molecular Biology & Evolution) conference has been published:
 


I selected some that have a particular focus on protein evolution (so a direct link with this blog):


2. Biochemistry meets molecular evolution
Molecular evolution research is often focused on sequence analysis, treating genes and genomes as simple strings composed of letters A, C, G, and T. Yet these sequences represent real, three-dimensional molecules with complex structure and function. Many of the most fundamental breakthroughs in our understanding of molecular evolution have come from extracting core findings from biochemistry and molecular biology and incorporating them into evolutionary models and techniques. For example, the structure of the genetic code gave us dN/dS tests. In more recent years, knowledge about RNA secondary structures, about nucleosome positioning signals, and about protein folding free energies have all contributed to our understanding of molecular evolution.
This symposium will bring together researchers working at the interface of biochemistry and molecular evolution, and contribute to an extended evolutionary synthesis that brings biochemistry and molecular biology into the core of evolutionary thought. The invited speakers are biochemists who would not normally attend SMBE, whose talks will complement submitted talks from the evolutionary biology community. Our hope is that the symposium will be a showcase for the emerging synthesis, highlighting the diversity of biochemical facts that are relevant to molecular evolution, and encouraging more molecular evolutionists to incorporate biochemical thinking into their work.
Organizers: Joanna Masel, Claus Wilke


7. Everything That Rises Must Converge
Although evolution mostly proceeds by accumulation of differences between groups, convergent evolution can occur when similar solutions are found to common evolutionary problems. Well known morphological examples include eyes and wings, but examples are increasingly being found at the molecular level, including proteins involved in echolocation in bats and cetaceans, foregut fermentation proteins in monkeys and cows, transcription factors in mammals and birds, and mitochondrial proteins in snakes and agamid lizards. These examples suggest that convergent molecular evolution may be more common than previously thought. Undetected convergence can provide strong support for incorrect phylogenies, while evidence for convergence can inform on the adaptive landscape, the constraints acting on evolutionary processes, and the role of chance and necessity in evolution.
Our symposium will discuss specific examples and provide theoretical insights addressing: 1) How can adaptive convergence be identified and distinguished from other evolutionary processes? 2) When is convergence a problem for phylogenetic analysis? 3) What information does convergence provide about selective pressures and the process of adaptation? The diverse ramifications of convergent evolution will be interesting to those who want to understand basic principles of evolution as well as those who want more accurate phylogenetic trees.
Organizers: David Pollock, Richard Goldstein


8. Evolution of Protein Superfamilies: Origins, Structure and Function (Combined)
Reconstructing and interpreting the phylogenetic history of protein superfamilies (i.e., families that include paralogous genes) pose unique challenges. Phylogenetic accuracy is particularly difficult to achieve in the face of the extreme functional and structural variability observed in protein superfamilies spanning many paralogous clades. Post-hoc interpretation of phylogenies offers valuable insights into protein structure and function, but best practices have yet to be established for tasks such as predicting gene origins relative to key transitions in species evolution and integrating this knowledge with other attributes of genes, phylogenomic ortholog identification and elucidating the co-evolution of sequence, structure and function.
We welcome abstracts that tackle reconstructing and interpreting protein superfamily trees, including large-scale gene origin and ortholog analysis, novel benchmarking and simulation studies, domain-based phylogenies, reconstructing ancestral proteins, and other issues involved in protein superfamily analysis.
Organizers: Tony Capra, Dannie Durand, Toni Gabaldon, Christine Orengo, Kimmen Sjolander, Maureen Stolzer.


26. Mutation: The Ultimate Source of Molecular Variation
As biologists, we are interested in explaining variation at multiple levels: what makes species different, what makes individuals different, what makes genes different, etc. One cannot study such variation without understanding mutation. Mutations can result from errors during replication, mistakes during recombination, or from environmental factors. Although there are mutation rates estimated for taxa, mutation rates vary across the genome, and may vary with age. While point-mutations have been well studied, more complex events like indels or segmental-duplications have not.
Mutations are of great interest for studying population history and mechanisms of evolution. Mutations are also widely sought after as the causes of disease and phenotypic variation. However, mutations do not occur in a vacuum; genetic background influences mutations’ impact on fitness. In this symposium, we bring together researchers who study the origin and impact of mutations, including the influence of mutation patterns on evolution and disease.
Organizers: Reed A. Cartwright, Melissa A. Wilson Sayers


35. The Origin and Evolution of Early Life
Major transitions during the early history of life pose several challenges, from the origin of genetic systems and protein translation, to the core components of all cells and the last universal common ancestor (LUCA), followed by the radiation of the major cellular lineages. This symposium will explore key events during early evolution and the methodological advances that are bringing new insights to this important, but still poorly understood, period of evolutionary history.
The topics to be considered will include the evolution and optimization of the genetic code and the molecules that implement it; the nature, genome and metabolic potential of LUCA; the relationships among the major cellular lineages, and the evolutionary processes underlying their diversification. Recent developments in phylogenetic modeling, ancestral sequence reconstruction, gene tree/species tree reconciliation, and phylogenetic networks all promise to shed new light on early evolution, and this symposium will welcome these and other approaches to these fascinating and enduring problems.
Organizers: Tom Williams, Martin Embley, Steven E. Massey, Aaron Goldman.


36. The role of epistasis in molecular evolution
Epistasis (nonadditive interactions between mutations) can influence the rate and direction of evolutionary change, and is therefore of longstanding interest to evolutionary geneticists. In molecular evolution, insights into the form and prevalence of epistasis are relevant to fundamental questions about the topography of adaptive landscapes and the predictability of mutational pathways through sequence space. In recent years, microbial experimental evolution studies have demonstrated how epistasis shapes the structure of the genotype-fitness map, and directed-mutagenesis studies have documented epistasis between mutant sites in the same protein, revealing the direct causes of genetic constraints on adaptation and shedding light on the selective accessibility of alternative mutational paths to high-fitness genotypes.
Within the past couple of years, a number of high-profile papers have opened up fresh debates about the role of epistasis in molecular evolution (e.g., Breen et al. [2012] Nature 490: 535-538; McCandlish et al. [2013] Nature 497:E1-E2). It is clear that a symposium on this topic would be very timely. The purpose of the proposed symposium is to showcase recent theoretical and empirical advances in our understanding of epistasis and its influence on evolutionary mechanism and process. Our aim is to showcase work that tackles big questions and motivates new research directions.
Organizers: Jay F. Storz, Kristi L. Montooth

Wednesday, 20 November 2013

How evolutionary biologists can help biologists from other fields (especially the medical side).

In a recent post, Dan Graur made various criticism to a paper publish last year in PloS One:

and here is the link to the paper:

I like the concept of this paper, which is mixing various bioinformatics methods to identify evolutionary pressures on sites and trying to correlate them with disease mutations (see also our review on the subject (http://www.biochemj.org/bj/449/bj4490581.htm) and this recent one (www.sciencedirect.com/science/article/pii/S0022283613004464#) )

I have other comments (more recommendation) I would like to share in this blog:

- They used alignment from ClustalW, which is surprising as it is the oldest of these methods. Better methods have been publish since them, one can cite Muscle, Mafft, Probcons, T-coffee and Clustal-Omega. The authors compared some of them, but found that ClustalW was the best to use in their study. They could also have  tried some phylogeny-aware method to align (PRANK, PAGAN), to see if it makes a difference or not.

Recommendation number 1: try and/or use different alignment methods that have been proven to be robust.


- There is no indication on how they built the tree (NJ? ML? Bayes?). I tried to produce a tree, but I got the same results, with the Rat/Mouse clade also badly placed. These two sequences are highly divergent compared to the other sequences, so no surprise that even sophisticated algorithms produce strange results.

Recommendation number 2: while Minimum Evolution (FastME) and Neighbour-Joining (FastTree) can provide some good results, always try and/or use different methods to build a phylogenetic tree, especially with Maximum Likelihood (PhyML, RAxML) and Bayesian (MrBayes).

Recommendation number 3: always indicate all the details on the methods used: name of the program, release used, parameters used (other than default).

 
- They could have root the tree by Teleost fishes:



This is not wrong, but it would be much better to present it as rooted.

Recommendation number 4: always try to present your results in the light of evolution.
 

- The methods they used to analyse conservation/functional divergence are sensitive to the quality of the input, but I think they can tolerate a few topological error. For example, Pupko and Galtier find that their algorithm gives the same result with different trees:
http://rspb.royalsocietypublishing.org/content/269/1498/1313.short
"Several controls were performed to check the robustness of these results. Results were essentially unchanged when a different phylogenetic tree (Reyes et al. 2000) was used (not shown)."
As the algorithms are similar (i.e. Diverge), I would not be surprised to see a similar result with the taxonomic tree. Of course, as a reviewer, I would ask the authors to use the two alternative topologies: taxonomic tree and genetic tree.

- The sheep is badly placed, but considering the short branch, not really a surprise. And again, I don't think this will influence greatly the final result.

- However, my main worry is the power in their analysis, and what they tried to do. For example, they used DIVERGE like this: "DIVERGE site-specific evolutionary constraint values were computed using the depth 4 (vertebrate) data set only. DIVERGE was run by splitting the vertebrate phylogeny at the deepest node separating the fish from the terrestrial vertebrates, and Type II divergence values were recorded." So they are comparing if there is any sites under functional divergence between Fishes (3 species) and Tetrapodes (15 species).

There are three problems here:
1) They just analysed the deepest evolutionary event. If anything functional divergence happened later (i.e. in Mammals), we could not detect it.
2) They used a clearly unbalanced dataset (3 versus 15).
3) With 3 species in one side, they have clearly no power, no accuracy. The pattern of conservation (or divergence) can just be noise. Normally, at least four species are needed in each clade to see a significant pattern.

Recommendation number 5: always try to have a significant amount of data in order to have enough power. We are in the genomic era, so a wealth amount of sequence is available. For example, thousand of sequences enable to have significant power to predict 3D structures.

Recommendation number 6, and the most important: always try to understand what you are doing. Especially to see if the methods you are using are appropriate on your data. Most of evolutionary bioinformatics tools are not straightforward, and some are quite complex. For example, I have spent a significant part of my PhD to understand and implement CodeML/PAML. While I am quite confident in it, I think I have always things to learn in this field.


=> If you are not familiar with evolutionary bioinformatics tools, ask a colleague or an expert. They will be more than happy to help.