You are on page 1of 11

Molecular dynamics with Gromacs

Gromacs is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. Gromacs is available under the GNU public license from www.gromacs.org. There you can find all sorts of documentation, including an extensive Gromacs manual and a mailing list where it is possible to send questions to other Gromacs users. You can also browse the mail list archives to find solutions to your problems. You will follow the instructions below and answer all questions marked with Q: in a separate report file that you should hand in before the next lab. You can work in pairs and hand in the report together, as long as both of you are present at the time of the lab. Should you choose to work individually, we expect you to hand in your own report, which then should be based on your own individual work, and not copied from someone elses report. Before you can start this tutorial you need to make sure you have access to Gromacs. It is already installed (together with xmgrace, see below) but you need to tell the computer where to find it. To do this, copy the bash script GMXRC.bash from AFS (accessible over the konsole in the computerlab) to your Desktop or home directory: /afs/pdc.kth.se/home/c/cschwaig/vol05_Teaching/gromacs-4.5.3-x86-static/bin/ GMXRC.bash and run >> source GMXRC.bash This will add all the programs to your path. To test if the bash script worked, type luck at the command prompt. If Gromacs was loaded successfully, it should reply with a quote (all computations in Gromacs will end with a (more or less) funny quote). In the directory vol05_Teaching you find also, some files that you need later in this lab (/afs/pdc.kth.se/home/c/cschwaig/vol05_Teaching/). Copy them as well.

Background
The purpose of this tutorial is not to master all parts of Gromacs' simulation and analysis tools in detail, but rather to give an overview and "feeling" for the typical steps used in simulations. Since the time available for this exercise is rather limited we can only perform a very short sample simulation, but you will also have access to a slightly longer pre-calculated trajectory for analysis. In principle, the most basic system we could simulate would be water or even a gas like Argon, but to show some of the capabilities of the analysis tools we will

work with a protein, lysozyme. Lysozyme is a fascinating enzyme that has the ability of kill bacteria, and is present e.g. in tears, saliva, and egg white. Alexander Fleming discovered it in 1922, and it is one of the first protein X-ray structures to be determined (David Philips, 1965). It is fairly small by protein standards (although quite large by simulation standards) with 164 residues and a molecular weight of 14.4 kDa. With hydrogen atoms it consists of roughly 2900 atoms (hydrogen atoms are not present in the PDB structure since they only scatter X-rays weakly). In this tutorial, we are first going to set up your Gromacs environment, have a look at the structure, prepare the input files necessary for simulation, solvate the protein model in water, minimize and equilibrate it, and finally perform a short production simulation. After this we will test some simple analysis programs that are part of Gromacs.

The PDB structure


The first thing you need is a structure of the protein you want to simulate. Go to the Protein Data Bank at www.pdb.org, and download the PDB-file for 1LYD. It is always a good idea to look carefully at the structure before you start any simulations. Is the resolution high enough? Are there any missing residues? Are there any strange ligands? Check both the PDB web page and the header in the actual file. Finally, try to open it in a viewer such as PyMOL or VMD and see what it looks like. Q1: Looking at the protein, how many crystal waters, helices and sheets does it have?

Creating a Gromacs topology from the PDB file


Doing a molecular dynamics simulation in Gromacs involves several steps, each using different programs that are part of the Gromacs suite. The first step in this process is to take your PDB file and convert it into a format that Gromacs can use. This is done with a command called pdb2gmx, which as the name suggest, creates Gromacs (gmx)-files from the PDB-file. First, try the command pdb2gmx without doing anything to the PDB-file by typing >> pdb2gmx -h This will print some info about the command on the screen, and you can read about what the command does, and what sorts of input and output files it works with. All Gromacs programs work by specifying input and output files using different flags. For example, for pdb2gmx the name of the input PDB-file is specified with the -f flag. If you specify only a flag without any filename, Gromacs will look for a default name instead.

Q2: Looking at the help text, what is the default name for the input PDB-file for pdb2gmx? Now it is time to prepare the Gromacs input files. This is done with the pdb2gmx command by typing: >> pdb2gmx -f 1LYD.pdb -water tip3p This will take the PDB-file 1LYD.pdb and create a Gromacs configuration file. As input to pdb2gmx we have also specified that we want to use the tip3p water model. This has nothing to do with the protein itself, but rather tells Gromacs how it should represent the water molecules that we are adding later. The program will ask you for a force field - for this tutorial, we suggest using "OPLS- AA/L" (so press 5 followed by enter), and then write out lots of information. Also, pdb2gmx will add all hydrogen atoms to the protein. The important thing to look for is if there were any errors or warnings (it will say at the end). Otherwise you are probably fine. If everything went ok, there should now be some more files in your working directory. The important output files produced by pdb2gmx are conf.gro, topol.top and posre.itp If you issue the command several times, Gromacs will back up old files so their names start with a hash mark (#) - you can safely ignore (or erase) these for now. Now, inspect the output files. In the file conf.gro you will find essentially the same information as in the original PDB-file, the coordinates of all atoms in the system. The file topol.top will contain the topology of the system. The topology is a description of each atom in the system (what type it is etc.) as well as information about the bonds, angles and dihedrals that are present in the system. At the end of this file you will find some additional information, such as the name of the system and the number of protein and solvent molecules. The last file, which is posre.itp, is used when we want to restrain certain atoms in a system and carries information about the magnitude of the restraining force on each atom.

Adding solvent water around the protein


If you started a simulation now, it would be in vacuum. This was actually the norm 25-30 years ago, since it is much faster, but because we want this simulation to be more realistic we should add solvent (water). However, before adding solvent, we have to determine how big the simulation box should be, and what shape to use for it. This is done with the command editconf. The size of the box determines how many water molecules there will be around the protein, and thus will influence the amount of time the simulation

will take. Due to the restricted time we will try and use such a small and efficient box as possible. Q3: Using the help for editconf (editconf -h) to find out what the different choices for box type are. Now use editconf to generate a box. We will use a rhombic dodecahedron box, which is most space efficient. In addition, to reduce the simulation time, we will specify the minimum distance between the protein and any edge of the box to be 0.5 nm, which is a bit small to be fair. >> editconf -f conf.gro -bt dodecahedron -d 0.5 -o box.gro Q4: Use some of the other box types that you found above, and see what effect it has on the volume (the volume is written on the output of editconf) The last step before the simulation is to add water in the box to solvate the protein. Using a small pre-equilibrated system of water coordinates that is repeated over the box achieves this, and overlapping water molecules are then removed. The lysozyme system will require roughly 4000 water molecules, which increases the number of atoms significantly (from about 2900 to over 20000). To put waters around the lysozyme, we use the command genbox >> genbox -cp box.gro -cs spc216.gro -p topol.top -o solvated.gro Solvent coordinates are taken from an SPC water system here, and the -p flag add the new water molecules to the topology as well as to the new coordinate file (solvated.gro). Before you continue, it is a good idea to look at this system in PyMOL. However, PyMOL cannot read the Gromacs gro file directly. Therefore, Gromacs comes with a tool that will produce a PDB file suitable for PyMOL. This tool is called tjrconv: >> trjconv -s solvated.gro -f solvated.gro -o solvated.pdb The command trjconv will introduce you to another important concept, which is the different groups that are defined in the system. By default, there are 13 defined groups. These groups will allow you to single out different parts of the system, so that you can perform e.g. more detailed analyses on them individually. However, we now want to have a look at the protein as it sits in the water box, so select the entire system by selecting 0 in the list. Then load the output PDB-file (solvated.pdb) into PyMOL and have a look at it. You should see the protein sitting in a box of water. Q5: Use PyMOL to export a picture of your simulation box and add it to your report.

Running energy minimization


When we add the water molecules to the system and also add the hydrogen atoms to the lysozyme molecule, these will not be perfectly placed to produce optimal hydrogen bonding geometries, and would give rise to quite large forces

and structure distortion if dynamics simulation was started immediately. To reduce these forces it is necessary to first run a short energy minimization. This will result in a structure that is stable enough for us to run a full simulation later. Energy minimization can be done in lots of ways, and we need to tell Gromacs how we want it done. To this end, we use a special parameter file (called an mdp- file) that will tell Gromacs values of different parameters that we want to use. This file is called em.mdp (em for energy minimisation) here. To get it, either download or make it your self by opening an empty file and type in: integrator = steep nsteps = 200 nstlist = 10 rlist = 1.0 coulombtype = pme rcoulomb = 1.0 vdw-type = cut-off rvdw = 1.0 nstenergy = 10 Do not worry if all these parameters don't make sense to you at the moment. The most important parameter here is integrator = steep which tells Gromacs to perform energy minimization with the steepest descent algorithm, which is a fast and robust algorithm (but not necessarily the best). Other parameters are nsteps = 200 which is the number of steps to be taken, and nstenergy = 10 which is how often we save the result of the energy minimization to the output file. Now, there are many files floating around in your working directory. Is it time to start the actual minimization. Gromacs uses a separate pre-processing program called grompp (for GROMacs PreProcessor) to collect parameters (in the mdp- file), topology (in the top-file) and coordinates (in the gro-file) into a single input file (em.tpr) from which the simulation is started. The necessary command is: >> grompp -f em.mdp -p topol.top -c solvated.gro -o em.tpr Before starting the energy minimization, you should check the output from grompp to make sure that there were no warnings (at least nothing important, you will probably get at least one warning that the protein has a non-zero charge, which you can ignore). Then start the energy minimization with the mdrun command: >> mdrun -v -deffnm em

The program mdrun, that will start the computation, requires lots of input files, but we use the smart shortcut argument -deffnm em, which will tell mdrun to use em as base name for all files but with different endings. The minimization should finish in a couple of minutes, and you should be able to follow the progress on the screen (hopefully the energy (Epot) will decrease).

Equilibrating the water around the protein


To avoid unnecessary distortion of the protein when the molecular dynamics simulation is started, we first need to perform an equilibration run where all the heavy atoms are restrained to their starting positions (using the file posre.itp generated earlier) while the water is relaxing around the structure. This should really be 50-100ps, but due to the time constraints we'll make do with 5ps (2500 steps) here. Other settings are identical to energy minimization, but for molecular dynamics we also need to assign the system temperature and pressure. This is again achieved with a parameter file (pr.mdb) that you can download or create yourself as above. The settings are: integrator = md nsteps = 2500 dt = 0.002 nstlist = 10 rlist = 1.0 coulombtype = pme rcoulomb = 1.0 vdw-type = cut-off rvdw = 1.0 tcoupl = Berendsen tc-grps = protein non-protein tau-t = 0.1 0.1 ref-t = 298 298 Pcoupl = Berendsen tau-p = 1.0 compressibility = 5e-5 ref-p = 1.0

nstxout = 1000 nstvout = 1000 nstxtcout = 100 nstenergy = 100 define = -DPOSRES Here, we use the setting integrator = md to tell Gromacs to do molecular dynamics instead of steepest descent energy minimization. Also, we assign the system to have the temperature 298 K using ref-t = 298 and a pressure of 1 atm by setting tau-p = 1.0. As you can see, it is possible in Gromacs to use different temperature and pressure settings for different parts of the system, which is sometimes needed. For a small protein such as lysozyme it should be more than enough with 100 ps (50000 steps) for the water to fully equilibrate around it, but in a large membrane system the slow lipid motions can require several nanoseconds of relaxation. The only way to know for certain is to watch the potential energy of the system, and extend the equilibration until it has converged. To run this equilibration in Gromacs you execute: >> grompp -f pr.mdp -p topol.top -c em.gro -o pr.tpr >> mdrun -v -deffnm pr If everything is working now, you should see some output on the screen which will tell you when the equilibration is finished (this will probably take around 10 minutes).

Running the production simulation


The difference between equilibration and production run is minimal: the position restraints and pressure coupling are turned off, we decide how often we want to write output coordinates to analyze (say, every 100 steps), and start a significantly longer simulation. How long depends on what you are studying, and that should be decided before starting any simulations! To get a feeling for the time it takes, create a file called run.mdp with the following contents, and then use the commands below to start a very short simulation. integrator = md nsteps = 5000 dt = 0.002 nstlist = 10

rlist = 1.0 coulombtype = pme rcoulomb = 1.0 vdw-type = cut-off rvdw = 1.0 tcoupl = Berendsen tc-grps = protein non-protein tau-t = 0.1 0.1 ref-t = 298 298 nstxout = 1000 nstvout = 1000 nstxtcout = 100 nstenergy = 100 >> grompp -f run.mdp -p topol.top -c pr.gro -o run.tpr >> mdrun -v -deffnm run Q6: How long time did this 5000-step simulation take? How many picoseconds did it run for? A rule of thumb says that your simulation should be at least 10 times longer than the phenomena you are studying, which unfortunately sometimes conflicts with reality and available computer resources. In this tutorial, there is really no time to do a full production run even of our relatively small system. Instead, you can download the files run.xtc (right-click on this file and select "Save as ..". Be careful not to open it in your browser as this is a binary file), run.gro and run.tpr, containing output data from a 10 ns simulation of this system (5 million steps, that took several days).

Analysis of the simulation


You will not likely have time to perform all these analyses, but pick at least two of (a), (b) or (c) that sounds interesting. Then, if you have time, you can either try all the analyses, or you can try to make a movie out of your simulation (d).

(a) Deviation from X-ray structure One of the most important fundamental properties to analyze is whether the protein is stable and close to the experimental structure. The standard way to measure this is the root-mean-square displacement (RMSD) of all heavy atoms with respect to the X-ray structure. Gromacs has a tool called g_rms that does this: >> g_rms -s run.tpr -f run.xtc The program will prompt for two groups to fit, choose "Protein-H" for both groups. The output of the program will be written to rmsd.xvg and you can get a finished graph directly from this file using the program xmgrace. >> xmgr rmsd.xvg& Q7: What is the maximum deviation between the simulated protein and the X- ray structure, and at what time does the protein seem to be stable? Q8: Use xmgrace to produce a picture of the RMSD-plot and include it in your report. Q9: Use the program trjconv to create a PDB-file that is a snapshot of the protein after it has stabilized (check the help text for trjconv if you don't know how to do this, and look at the function of the -dump flag). Take this structure as well as the starting structure, load them into PyMOL and make an alignment between the two. What is the reported rms value in PyMOL? Does it compare with the corresponding Gromacs value? (b) Comparing fluctuations with temperature factors Vibrations around the equilibrium are not random, but depend on local structure flexibility. Different parts of the protein may be very flexible whereas other parts more stiff. Once again there is a program in Gromacs (g_rmsf) that will calculate local flexibility. Use the group C-alpha to get one value per residue: >> g_rmsf -s run.tpr -f run.xtc -o rmsf.xvg Q10: Use xmgrace to visualize the output rmsf.xvg file. In which regions of the protein do you see the largest fluctuations? The root-mean-square fluctuations (RMSF) of each residue are straightforward to calculate over the trajectory, but more important they can be converted to temperature factors that are also present for each atom in a PDB file. These factors (called either temperature factors or b-factors) are a measure of how well the position of each atom was determined. A large factor means that this particular atom was not well defined, and its position is more approximate than for an atom with a low temperature factor.

Run the g_rmsf command once more (but add the flag -oq bfac.pdb) and select the entire protein for analysis. This will give you a PDB-file with the fluctuations written in the file. Q11: Use PyMOL to load the file bfac.pdb and make a picture of this structure colored according to the temperature factor of each residue. Now, its clearer to get a feeling of the flexibility of the different regions of the protein compared to the RMSF-plot in Q10. Did you answer it correctly? Make a picture of this state and include it in your report. Q12: Load the original starting structure into PyMOL and color it according to the crystallization beta-factors. Compare the two structures. Discuss similarities and differences in flexibility between the two structures. (c) Distance and hydrogen bonds With basic properties accurately reproduced, we can use the simulation to analyze more specific details. As an example, lysozyme appears to be stabilized by hydrogen bonds between residues Glu22 and Arg137, so how much does this fluctuate in the simulation, and are the hydrogen bonds intact? To determine this, first create an index file with these groups as: >> make_ndx -f run.gro An index file is a file that tells Gromacs what numbers belong to certain atoms or groups of atoms, enabling them to be singled out for further analysis. Thus, at the prompt, create a group for Glu22 with "r 22" and Arg137 with "r 137", and then "q" to save an index file as index.ndx. The distance and number of hydrogen bonds can now be calculated with the commands: >> g_dist -s run.tpr -f run.xtc -n index.ndx -o dist.xvg >> g_hbond -s run.tpr -f run.xtc -n index.ndx -num hbnum.xvg In both cases you should select the two groups you just created. Q13: Are the hydrogen bonds conserved throughout the trajectory? Q14: Between what atoms are hydrogen bonds normally formed and why? Q15: Open the structure in PyMOL and draw the hydrogen bond(s) that you were looking at. What is the length of the bond? Include the image in your report. (d) Making a movie A normal movie uses roughly 30 frames/second, so a 10 second movie requires 300 simulation trajectory frames. To make a smooth movie the frames should not be more than 1-2 ps apart, or it will just shake nervously. Export a short trajectory from the first 2.5 ns in PDB format (readable by PyMOL) as:

>> trjconv -s run.gro -f run.xtc -e 2500 -o movie.pdb Choose the protein group for output rather than the entire system. If you open this trajectory in PyMol you can immediately play it using the VCR-style controls on the bottom right. Now you are finished with the Gromacs tutorial. Please answer questions 1-6 and the relevant ones in the last section. Hand in the report (by e-mail to cschwaig@cbr.su.se) before ten days from today. Please, keep the report total size < 5 MB.

You might also like