Friday, 27 November 2015

Cleaning Python scripts with Pylint and GNU/Expand

Python is a wonderful programming language. But it can be quite syntax error tolerant. For example, the indentation is really important, but you can use either tabulations or spaces. You can also mix them in Python 2 (forbidden in Python 3).

These days, I am trying to stick to the rules (I am getting old). For that, there is the PEP8 style guide: https://www.python.org/dev/peps/pep-0008/

For example, the official format for identation is 4 spaces per indentation level.

I found Pylint, which check your code for such errors: http://www.pylint.org/

It gives a list of all erors line by line, and a global score.

Let's give a try. Here is a code to print fibonacci numbers, which you can download it from here: https://drive.google.com/open?id=0BxcXpZeUylGbSTNCZDVCQmRQZTA

Save it as fibo.py and launch it:

python fibo.py 9
0 1
1 1
1 2
2 3
3 5
5 8
8 13
13 21
(9, 21)

It works! Great!

But let's have a look with Pylint:


Global evaluation
-----------------
Your code has been rated at 0.83/10

Ouch, this hurts!

Many problems apparently:

************* Module fibo
C:  9, 0: Exactly one space required after comma
    i,j = 1,0
     ^ (bad-whitespace)
C:  9, 0: Exactly one space required after comma
    i,j = 1,0
           ^ (bad-whitespace)
C: 11, 0: Exactly one space required after comma
    for k in range(1,n + 1):
                    ^ (bad-whitespace)
W: 12, 0: Found indentation with tabs instead of spaces (mixed-indentation)
C: 12, 0: Exactly one space required after comma
        i,j = j, i + j
      ^ (bad-whitespace)
C: 13, 0: Trailing whitespace (trailing-whitespace)
W: 13, 0: Found indentation with tabs instead of spaces (mixed-indentation)
C: 13, 0: Exactly one space required after comma
        print i,j
            ^ (bad-whitespace)
C: 15, 0: Trailing whitespace (trailing-whitespace)
W: 11, 8: Unused variable 'k' (unused-variable)
W:  3, 0: Unused import os (unused-import)

First problem, there is a mix and match with tab and space. It is easy to manually correct a small file, it can be tough with a very big file. GNU/UNIX provides some tools for that:

1) command cat 

https://www.gnu.org/software/coreutils/manual/html_node/cat-invocation.html
 
With Linux: cat -A fibo.py
With MacOSX: cat -e -t  fibo.py
With Windows: guru meditation error, sorry.

=> space will be displayed as space, but tab will now be displayed as "^I". Easier to see any problem.

2) command expand

http://www.gnu.org/software/coreutils/manual/html_node/expand-invocation.html
 
expand -t 4 fibo.py > tmp.txt  # Change all tab to 4-space
mv tmp.txt fibo.py  # move back the filename.
chmod +x fibo.py  # make it executable again.


And now try again
cat -e -t  fibo.py
pylint fibo.py

Global evaluation
-----------------
Your code has been rated at 2.50/10 (previous run: 0.83/10, +1.67)


Good, there is some progresses.


C:  9, 0: Exactly one space required after comma
    i,j = 1,0
     ^ (bad-whitespace)
C:  9, 0: Exactly one space required after comma
    i,j = 1,0
           ^ (bad-whitespace)
C: 11, 0: Exactly one space required after comma
    for k in range(1,n + 1):
                    ^ (bad-whitespace)
C: 12, 0: Exactly one space required after comma
        i,j = j, i + j
         ^ (bad-whitespace)
C: 13, 0: Trailing whitespace (trailing-whitespace)
C: 13, 0: Exactly one space required after comma
        print i,j
               ^ (bad-whitespace)
C: 15, 0: Trailing whitespace (trailing-whitespace)
W: 11, 8: Unused variable 'k' (unused-variable)
W:  3, 0: Unused import os (unused-import)

The rest of the code is more styling errors:
- Add space after the comma.
- Remove import os
- Remove trailing white-space (visible with cat -A / cat -e -t)

And now:
Your code has been rated at 9.09/10.
Much better!

The last problem is an unused variable k. We can let it like this or change it with a more elegant way (i.e. with a while loop).


So, recommandation:

- Do your code properly since the begining.
- Use Pylint to identify errors.
- Correct the errors.
- Have a look to see if your code is not changed (i.e. identation shifted).
- Run your code again to check if it is working.


PS: It looks like Emacs 25.0 is properly handling the tab key, aka adding 4-spaces in the file instead of a tab.



Wednesday, 25 November 2015

Generate a PDF version of the Google Scholar citation histogram

Google Scholar provides a nice way to illustrate your citation records as histogram:

I wanted to add it to my publication record document, but the problem is we cannot save as a picture. We could save as a screenshot but as bitmap, it will look blurry. Hopefully, I find a nice way to reproduce the plot with R (https://www.r-project.org/) and ggplot2 (http://ggplot2.org/).

First, create a simple file called "google_record.txt", which contains the following information:
2015 208
2014 159
2013 100
2012 58
2011 44
2010 30
2009 22
These numbers are simply obtained by rolling the mouse over the plot in the webpage of google scholar. Of course, don't forget to change the values by your numbers. ;)

Then, launch R and type the following commands:
if (!require("ggplot2")) {
   install.packages("ggplot2", dependencies = TRUE)
   library(ggplot2)
}

df <- read.table("google_record.txt")

ggplot(df,aes(x=factor(V1), y=V2))+geom_bar(stat = "identity", fill="gray40")+
    theme(panel.background= element_blank())+
    theme(panel.grid.major.x = element_blank())+
    theme(panel.grid.major.y = element_line(size=0.5, color="grey"))+
    theme(panel.grid.minor.y = element_line(size=0.5, color="grey"))+
    theme(axis.ticks = element_line(size = 0.5, colour="grey"))+
    theme(axis.text = element_text(colour="black", size=rel(1.2)))+
    labs(title = "Citations per year", x="", y="", size=rel(1.3))

ggsave("google_record.pdf",height=3,width=5)

You will then obtain a clean PDF file to include in your document.

Wednesday, 25 March 2015

Tutorial: estimating the stability effect of a mutation with FoldX

Introduction:

Here is a brief tutorial on how to use FoldX to estimate the stability effect of a mutation in a 3D structure. The stability (ΔG) of a protein is defined by the free energy, which is express in kcal/mol. The lower it is, the more stable it is. G is difference of free energy between a wild-type and mutant. A mutation that brings energy (ΔΔG>kcal/mol) will destabilise the structure, while a mutation that remove energy (ΔΔG<kcal/mol) will stabilise the structure. A common threshold is to say that a mutation has a significant effect if ΔΔG is >1 kcal/mol, which roughly corresponds to a single hydrogen bond.

A way to compute the free energy is to use molecular dynamics. Main problem: it can be very time-consuming.

FoldX uses an empirical method to estimate the stability effect of a mutation. The executable is available here: http://foldx.crg.es/

NB: I strongly encourage to read the manual (before or in parallel of this tutorial).

Foldx was used in many studies, i.e.:
Tokuriki N, Stricher F, Serrano L, Tawfik DS. How protein stability and new functions trade off. PLoS Comput Biol. 2008 Feb 29;4(2):e1000002 http://dx.doi.org/10.1371/journal.pcbi.1000002

Dasmeh P, Serohijos AW, Kepp KP, Shakhnovich EI. Positively selected sites in cetacean myoglobins contribute to protein stability. PLoS Comput Biol. 2013;9(3):e1002929. http://dx.doi.org/10.1371/journal.pcbi.1002929
And I personally used it in three of my studies:
Studer RA, Christin PA, Williams MA, Orengo CA. Stability-activity tradeoffs constrain the adaptive evolution of RubisCO. Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2223-8. http://dx.doi.org/10.1073/pnas.1310811111
Studer RA, Opperdoes FR, Nicolaes GA, Mulder AB, Mulder R. Understanding the functional difference between growth arrest-specific protein 6 and protein S: an evolutionary approach. Open Biol. 2014 Oct;4(10). pii: 140121. http://dx.doi.org/10.1098/rsob.140121

Rallapalli PM, Orengo CA, Studer RA, Perkins SJ. Positive selection during the evolution of the blood coagulation factors in the context of their disease-causing mutations. Mol Biol Evol. 2014 Nov;31(11):3040-56. http://dx.doi.org/10.1093/molbev/msu248

Example:

The structure is a bacterial cytochrome P450 (PDB:4TVF). You can download it PDB file (4TVF.pdb) from here: http://www.rcsb.org/pdb/explore.do?structureId=4TVF

We would to test the stability of mutatinf the leucine (L) at position 280 to an aspartic acid (D).

Here is the original structure, with Leu280 in green, and residues around 6Å in yellow:



FoldX has different modes to run it, but I use the mode "runfile", which contains COMMANDS and OPTIONS in one single file.


FoldX works in two steps:


1) Repair the structure.

There are frequently problems in PDB structures, like steric clashes. FoldX will try to fix them and lower the global energy (ΔG). The command "RepairPDB" is better than the "Optimize" command. Here is the command file "foldx_repair.txt":
<TITLE>FOLDX_runscript;
<JOBSTART>#;
<PDBS>4TVF.pdb;
<BATCH>#;
<COMMANDS>FOLDX_commandfile;
<RepairPDB>#;
<END>#;
<OPTIONS>FOLDX_optionfile;
<Temperature>298;
<R>#;
<pH>7;
<IonStrength>0.050;
<water>-CRYSTAL;
<metal>-CRYSTAL;
<VdWDesign>2;
<OutPDB>true;
<pdb_hydrogens>false;
<END>#;
<JOBEND>#;
<ENDFILE>#;
We indicate which PDB file it needs to use, that we want to repair it (<RepairPDB>), that it will use water and metal bonds from the PDB file (<water>-CRYSTAL; <metal>-CRYSTAL;) and that we want a PDB as output (<OutPDB>true).

You have to run the command file "foldx_repair.txt" like this:
foldx3b6 -runfile foldx_repair.txt
This process is quite long (around 10 minutes). Here is the result (the original structure is now in white, while the repaired structure is in yellow/green):

 We can see that some side chains have slightly move (in particular Phe16).

The starting free energy ΔG was 64.99 kcal/mol, and it was lowered to -48.15 kcal/mol, which is now stable (remember that a "+" sign means unstable while a "-" sign means stable).

Once it's finished, it will produce a file named "RepairPDB_4TVF.pdb", which you will use in the next step.


2) Perform the mutation

The mutation itself is perform by the BuildModel function. There are other methods, but the BuildModel is the most robust. You need also to specify the mutation in a separate file "individual_list.txt".

In the command file, you will see that is RepairPDB_4K33.pdb and not 4K33.pdb that is mutated. You will also notice "<numberOfRuns>3;". This is because some residues can have many rotamers and could have some convergence problems. You may to increase this values to 5 or 10, in case you are mutated long residues (i.e. Arginine) that have many rotamers.

Here the command file "foldx_build.txt":
<TITLE>FOLDX_runscript;
<JOBSTART>#;
<PDBS>RepairPDB_4TVF.pdb;
<BATCH>#;
<COMMANDS>FOLDX_commandfile;
<BuildModel>#,individual_list.txt;
<END>#;
<OPTIONS>FOLDX_optionfile;
<Temperature>298;
<R>#;
<pH>7;
<IonStrength>0.050;
<water>-CRYSTAL;
<metal>-CRYSTAL;
<VdWDesign>2;
<OutPDB>true;
<numberOfRuns>3;
<END>#;
<JOBEND>#;
<ENDFILE>#;
and the "individual_list.txt" (just one line):
LA280D;
It contains the starting amino acid (L), the chain (A), the position (280) and the amino acid you want at the end (D). One line correspond to one mutant. It means you can mutate many residues at the same per line (mutant) and also  produce different mutants by different numbers of lines.

You can run it by:
foldx3b6 -runfile foldx_build.txt
It is much faster this time (i.e. a few seconds) and will produce many files.

FoldX will first mutate the target residue (L) to itself (L) and move it as well as all neighbouring side chains multiple times. We can see that Leu280 (green) was rotated:

=> This is will give the free energy of the wild-type (let's call it ΔGwt).

Then, it will mutate the target residue (L) to the desired mutant (D) and move it as well as all neighbouring side chains multiple times. We can see that Leu280 is mutated to Asp280 (see the two oxygen atoms in red):


=> This is will give the free energy of the mutant (let's call it ΔGmut).



The difference in free energy (ΔΔG) is given by ΔGmut-ΔGwt.

In the file "Raw_BuildModel_RepairPDB_4TVF.fxout", you can retrieve the energy of the three runs for both WT and Mutant.

Run1:
  • ΔGmut = RepairPDB_4TVF_1_0.pdb = -41.1377 kcal/mol
  • ΔGwt = WT_RepairPDB_4TVF_1_0.pdb = -46.0464 kcal/mol
  • => ΔΔG = ΔGmut-ΔGwt = (-41.1377)-(-46.0464) = +4.9087 kcal/mol

One file contains the average difference over all runs: "Average_BuildModel_RepairPDB_4K33.fxout".
You will notice that the difference in free energy ΔΔG is +4.84 kcal/mol (+- 0.06 kcal/mol).

=> It means the mutation L280D is highly destabilising (positive value, and much above 1.0 kcal/mol). Here is the final mutant:


PS: Another way to define the threshold is to use the SD deviation multiple times:

The reported accuracy of FoldX is 0.46 kcal/mol (i.e., the SD of the difference
between ΔΔGs calculated by FoldX and the experimental values). We can bin the ΔΔG values into seven categories:
  1. highly stabilising (ΔΔG < −1.84 kcal/mol); 
  2. stabilising (−1.84 kcal/mol ≤ ΔΔG < −0.92 kcal/mol); 
  3. slightly stabilising (−0.92 kcal/mol ≤ ΔΔG < −0.46 kcal/mol); 
  4. neutral (−0.46 kcal/mol < ΔΔG ≤ +0.46 kcal/mol);
  5. slightly destabilising (+0.46 kcal/mol < ΔΔG ≤ +0.92 kcal/mol);
  6. destabilising (+0.92 kcal/mol < ΔΔG ≤ +1.84 kcal/mol);
  7. highly destabilising (ΔΔG > +1.84 kcal/mol).



Monday, 15 December 2014

Symposium on Protein Evolution and Structure - ESEB2015 - Lausanne Aug10-14

Dear colleagues,


We are delighted to invite submissions for the ESEB 2015 symposium, in Lausanne, Switzerland.
 
 
 "PROTEIN EVOLUTION: STRUCTURAL AND FUNCTIONAL PERSPECTIVE".

The symposium will focus on protein evolution in broad terms, including protein conservation and adaptation, detection of positive selection and co-evolution, structural evolution and stability constraints. We welcome submissions integrating studies on protein evolution with biochemistry and functional/structural genomics.

The symposium will take place during the 15th Congress of the European Society for Evolutionary Biology (ESEB) on 10 - 14 August 2015 in Lausanne, Switzerland.

REGISTRATION:
    Deadline: 10 January 2015
    http://www3.unil.ch/wpmu/eseb2015

INVITED SPEAKERS:
    Dan Tawfik, Weizmann Institute of Science, Israel
    Richard Goldstein, University College London, UK

SYMPOSIUM DESCRIPTION:
Proteins evolve by the replacement of amino acids (substitutions) or the insertion/deletion of fragments (indels). For the protein, mutations may be deleterious or beneficial, governed by the laws of natural selection. Beneficial mutations increase the fitness of the phenotype and are more likely to become fixed in the genome (positive selection). Proteins are not robust to drastic changes (i.e. important changes in stability) and mutations that favour an adaptive functional change are generally accompanied by other coevolving mutations that insure the integrity of the 3D structure (compensatory effect). All these biophysical properties are paving the way for protein evolution.
Traditionally, there was little if any crosstalk between the fields of protein biophysics, protein structure-function and molecular evolution. The last several years have seen some exciting development in merging these areas to obtain an in-depth understanding of how proteins evolve.
For example, a better understanding of how structural constraints affect protein evolution will greatly help to optimise stochastic models of sequence evolution. The symposium aims at exploring this new synthesis.

Abstracts will be selected for presentation by early March. When submitting your abstract please state your preference (talk, poster) during the submission process.
With questions please contact the symposium organizers:

*    Romain Studer (rstuder@ebi.ac.uk)
*    Maria Anisimova (maria.anisimova@zhaw.ch)

We are looking forward to seeing you at ESEB 2015!

Romain and Maria


_________________________________

Romain Studer
European Bioinformatics Institute
Wellcome Trust Genome Campus, UK

Twitter: @RomainStuder
Blog: http://evosite3d.blogspot.co.uk/

Maria Anisimova
Institute of Applied Simulations
Zurich University of Applied Sciences
Switzerland

Evolutionary Genomics:
Vol 1: http://www.springer.com/biomed/human+genetics/book/978-1-61779-581-7
Vol 2: http://www.springer.com/biomed/human+genetics/book/978-1-61779-584-8

Thursday, 25 September 2014

Tutorial on Ancestral Sequence Reconstruction

This tutorial was part of a course on protein evolution done during ECCB 2014 in Strasbourg: http://www.eccb14.org/program/tutorials/pea


NB: If there are any question, comments or bugs, feel free to ask. ;)



Introduction

The slides of the introduction are available here:
ancestral_sequence_reconstruction.pdf


In this practical, you will learn how to prepare files for CodeML, how to use it for reconstructing ancestral sequences and how to compute the isoelectric point of protein sequences. There are a few scripts you will need to download during the practical.

If you need to make these scripts executable, use chmod:
chmod +x script.py


Tools used in this practical:

# Libraries for Python
Biopython: http://biopython.org/wiki/Main_Page

# Package for ancestral sequence reconstruction (and many other things)
CodeML from PAML: http://abacus.gene.ucl.ac.uk/software/paml.html

# Alignment tools
MAFFT L-INS-i: http://mafft.cbrc.jp/alignment/software/
Clustal-Omega: http://www.clustal.org/omega/
(Clustal-Omega is the new aligner from ClustalW team, but much faster and more accurate)

# Alignment visualisation
Jalview: http://www.jalview.org/

# Phylogenetics tools
PhyML: http://www.atgc-montpellier.fr/phyml/binaries.php
FastTree: http://www.microbesonline.org/fasttree/

# Tree visualisation
NJplot: http://doua.prabi.fr/software/njplot



This practical will focus on the lysozyme, an enzyme (EC 3.2.1.17) that damages bacterial cell walls. The Uniprot page is: http://www.uniprot.org/uniprot/P61626
They evolved differently in Primates:

















 

 

 

 

 

 

Step 1: Prepare alignment.

The quality of the ancestral reconstruction will heavily depend on the quality of the alignment and the tree topology (branch lengths are re-estimated during the reconstruction).

Please download the sequence file: lysozyme_primates.seq

Make a multiple alignment, either with Mafft-L-INS-i or Clustal-Omega:

mafft-linsi lysozyme_primates.seq > lysozyme_primates.fasta
or
clustal-omega-1.2.0-macosx --in lysozyme_primates.seq --out lysozyme_primates.fasta

(Please have a look at the alignment in Jalview)


The format of the resulting alignment is FASTA. However, most phylogenetic softwares use PHYLIP format. So, you have to convert it into PHYLIP.
Download the script  "convert_fasta2phylip.py" and execute it:

convert_fasta2phylip.py lysozyme_primates.fasta lysozyme_primates.phy

(If you are not familiar, have a look at the differences between the alignment in FASTA and PHYLIP formats).


Step 2: Prepare alignment.

We can now generate a tree, either with PhyML (one of the most accurate tool) or FastTree (very fast and pretty accurate):

phyml -i lysozyme_primates.phy -d aa -m JTT -c 4 -a e -b 0
mv lysozyme_primates.phy_phyml_tree.txt lysozyme_primates.tree

Option used:
-i = input file
-d aa: amino acid sequences
-m JTT: (substitution matrix). JTT works fine for most proteins, but other matrices (WAG, LG) can do slightly better.
-c 4: (numbers of categories for the gamma distribution)
-a e: (estimate alpha parameter for the gamma distribution)
-b 0: (we don't want boostrap, as this will cause trouble for further analyses in CodeML).


or run:

FastTree -nosupport lysozyme_primates.phy > lysozyme_primates.tree

Option used:
-nosupport:(we don't want boostrap, as this will cause trouble for further analyses in CodeML).


Finally, we could root the tree. Use NJplot and root it by the group containing the Marmoset sequence (Callithrix jacchus).

Save it as "lysozyme_primates_rooted.tree"


Step 3: Run ancestral sequence reconstruction.


The ancestral sequence reconstruction is done by CodeML, from the PAML package.

It is launched with "codeml control_file.ctl"

You may have to copy the file "jones.dat" from the dat folder in the PAML package, or indicate its location.



The control file contains many parameters:

      seqfile = lysozyme_primates.phy    * sequence data filename
     treefile = lysozyme_primates_root.tree   * tree structure file name
      outfile = lysozyme_primates.mlc    * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen
      verbose = 2  * 0: concise; 1: detailed, 2: too much
      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

      seqtype = 2  * 1:codons; 2:AAs; 3:codons-->AAs

        clock = 0  * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
       aaDist = 0  * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
   aaRatefile = ./jones.dat  * only used for aa seqs with model=empirical(_F)

                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own

        model = 2
                   * models for codons:
                       * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                       * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                       * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)

        icode = 0  * 0:universal code; 1:mammalian mt; 2-10:see below
        Mgene = 0     * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff            
                   * AA: 0:rates, 1:separate

    fix_alpha = 0   * 0: estimate gamma shape parameter; 1: fix it at alpha
        alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 1   * different alphas for genes
        ncatG = 4   * # of categories in dG of NSsites models


        getSE = 0  * 0: don't want them, 1: want S.E.s of estimates

 RateAncestor = 1  * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)

   Small_Diff = .5e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
       method = 1  * Optimization method 0: simultaneous; 1: one branch a time


* Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt.,
* 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt.,
* 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt.,
* 10: blepharisma nu.
* These codes correspond to transl_table 1 to 11 of GENEBANK.


Explanation of some parameters:
runmode = 0 => We provide the tree.
clock = 0 => We don't set a molecular clock. We assume that the genes are evolving at different rate.
aaDist = 0 => We don't use the physicochemical properties of the amino acid.
aaRatefile = ./jones.dat => We use the JTT matrix. Other matrix could be used (WAG, etc...)
model = 2 => We use an empirical model (= substitutions matrix such as JTT).
fix_alpha = 0 => We estimated the alpha parameter of the gamma distribution.
alpha = 0.5 => We start the estimation from 0.5
RateAncestor = 1 => Force the estimation of ancestral states.
cleandata = 0 => Keep all ambigous data ("-", "X").



Please have a look at the output.

CodeML will also write into many files, but only two are of interest here:

  • lysozyme_primates.mlc => Contains many information on evolutionary rates.
  • rst => Contains ancestral states for sites and for nodes.
Please have a look at both files.


In the rst file, there is also how the tree has been annotated, under the line "tree with node labels for Rod Page's TreeView".
You can copy this tree into a file (i.e. "lysozyme_primates_annotated.tree"), open it with NJplot, and display the "bootstrap values".
We can see at which node correspond which ancestral sequence.


Questions:
- What is the estimated alpha parameter of the gamma curve?
- How many categories where used? And what are their frequencies?
- In Jalview, we observed at column 68 a mixture of E(Glu), Q(Gln) and R(Arg). But what was the most likely state of this position in the last common ancestor of all sequences? What are the probabilities of A(AlA), E(Glu), Q(Gln) and R(Arg)?
- What are the evolutionary events (amino acid substitutions) at the basis of the Hominoidea clade (Hylobates lar, Gorilla, Pan Paniscus, Homo sapiens)?


Now it is time to extract ancestral sequences and put them in a file. The rst file is quite difficult to parse, hopefully, each ancestral sequence start by "node".

Download the following script and execute it: parse_rst.py

./parse_rst.py rst

It displays ancestral sequences in FASTA format. Let's put them in a file:

./parse_rst.py rst > ancestral_sequences.fasta



Part 4: Compute physico-chemical properties on ancestral sequences.


In Biopython, there is a function to compute the isoelectric point (pI):

analysed_protein = ProtParam.ProteinAnalysis(sequence)
pI = analysed_protein.isoelectric_point()


The following script will compute the pI for all sequences in a FASTA file: compute_pI.py

By launching it, we can retrieve the pI for modern primate lysozymes:

./compute_pI.py lysozyme_primates.fasta

And similarly for ancestral sequences:

./compute_pI.py ancestral_sequences.fasta



Part 5: Map properties on tree.

We could easily map ancestral properties on the tree. The tree provided in rst contains nodes where bootstrap information is.
We just need to change the values of these nodes by the corresponding pI.

Download the following script and execute it: map_on_tree.py

./map_on_tree.py ancestral_sequences.fasta lysozyme_primates_annotated.tree >  lysozyme_primates_annotated_pI.tree

Have a look at both trees in a text editor.

You can load it in NJplot and see the different pI at the bootstrap place.


Alternatively, you can install FigTree: http://tree.bio.ed.ac.uk/software/figtree/

and load the tree.

In FigTree, set the following parameters:
Appearance->Colour by:label
Setup: Colours  -> Scheme: Colour gradient
Tick gradient
Line Weight 4

Wednesday, 11 June 2014

Is there any example of study on amino acid sites under positive selection (dN/dS) that have been tested in vitro (confirmatory or not)?

(This post is more a open question than a direct answer).


Is there any example of study on amino acid sites under positive selection (dN/dS) that have been tested in vitro (confirmatory or not)?

By dN/dS, I mean any amino acid sites in any species that have been detected with CodeML, either the site models (M2a, M8), or the branch-site model. Or any other similar methods that aim to identify sites that could favour adaptation during evolutionary course.

By in vitro, I mean that these sites has been mutated in vitro and tested to see if they provide a different phenotype, or if the targeted protein exhibits different biochemical properties.


===>   I have the impression that there are plenty of studies that identifies such amino acids under positive selection in a large-scale manner, i.e.:

More genes underwent positive selection in chimpanzee evolution than in human evolution
http://www.pnas.org/content/104/18/7489.full

Patterns of Positive Selection in Six Mammalian Genomes
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000144

Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes
http://genome.cshlp.org/content/18/9/1393.short

Patterns of Positive Selection in Seven Ant Genomes
http://mbe.oxfordjournals.org/content/early/2014/05/06/molbev.msu141.abstract



===> There are many studies that analyse these sites in silico by mapping them on the 3D structures, i.e.:

Patterns of Positive Selection in Six Mammalian Genomes
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000144

Adaptive Divergence of Ancient Gene Duplicates in the Avian MHC Class II β
http://mbe.oxfordjournals.org/content/27/10/2360

Evolution of Genes Involved in Gamete Interaction: Evidence for Positive Selection, Duplications and Losses in Vertebrates
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0044548


===> A few of them identifies the effect on the protein stability, i.e.:

Positively Selected Sites in Cetacean Myoglobins Contribute to Protein Stability
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002929

Stability-activity tradeoffs constrain the adaptive evolution of RubisCO
http://www.pnas.org/content/111/6/2223.long


===> But I have difficulties to find studies that actually tested these sites in vitro. I found these ones:

Adaptive evolution of multicolored fluorescent proteins in reef-building corals.
http://link.springer.com/article/10.1007%2Fs00239-005-0129-9

Structural and Functional Evolution of Positively Selected Sites in Pine Glutathione S-Transferase Enzyme Family
http://www.jbc.org/content/288/34/24441.full


But I am wondering if there are many other similar studies?


Thank you for your help!

Sunday, 1 June 2014

ECCB'14 - Tutorial - Protein Evolution Analysis: on the Use of Phylogenetic Trees

Hi all,

Just a small post to announce that me and two of my colleagues from EBI are doing a tutorial on the use of phylogenetics tree in evolutionary analysis (i.e. co-evolution between trees, codeml with biopython, ancestral sequence reconstruction). This will happen during ECCB 2014 in Strasbourg, France, on the Sunday September 7th.

All information available here:




See you there!