If we are going to understand
epigenetics, we first need to understand a bit about genetics and
genes. The basic code for pretty much all independent life on
earth, from bacteria to elephants, from Japanese knotweed to
humans, is DNA (deoxyribonucleic acid). The phrase ‘DNA’ has become
an expression in its own right with increasingly vague meanings.
Social commentators may refer to the DNA of a society or of a
corporation, by which they mean the real core of values behind an
organisation. There’s even been a perfume called after it. The
iconic scientific image of the mid-20th century was the atomic
mushroom cloud. The double helix of DNA had similar cachet in the
later part of the same century.
Science is just as prone to mood swings
and fashions as any other human activity. There was a period when
the prevailing orthodoxy seemed to be that the only thing that
mattered was our DNA script, our genetic inheritance. Chapters 1 and 2 showed
that this can’t be the case, as the same script is used differently
depending on its cellular context. The field is now possibly at
risk of swinging a bit too far in the opposite direction, with
hard-line epigeneticists almost minimizing the significance of the
DNA code. The truth is, of course, somewhere in
between.
In the Introduction, we described DNA as
a script. In the theatre, if a script is lousy then even a
wonderful director and a terrific cast won’t be able to create a
great production. On the other hand, we have probably all suffered
through terrible productions of our favourite plays. Even if the
script is perfect, the final outcome can be awful if the
interpretation is poor. In the same way, genetics and epigenetics
work intimately together to create the miracles that are us and
every organic thing around us.
DNA is the fundamental information source
in our cells, their basic blueprint. DNA itself isn’t the real
business end of things, in the sense that it doesn’t carry out all
the thousands of activities required just to keep us alive. That
job is mainly performed by the proteins. It’s proteins that carry
oxygen around our bloodstream, that turn chips and burgers into
sugars and other nutrients that can be absorbed from our guts and
used to power our brains, that contract our muscles so we can turn
the pages of this book. But DNA is what carries the codes for all
these proteins.
If DNA is a code, then it must contain
symbols that can be read. It must act like a language. This is
indeed exactly what the DNA code does. It might seem odd when we
think how complicated we humans are, but our DNA is a language with
only four letters. These letters are known as bases, and their full
names are adenine, cytosine, guanine and thymine. They are
abbreviated to A, C, G and T. It’s worth remembering C, cytosine,
in particular, because this is the most important of all the bases
in epigenetics.
One of the easiest ways to visualise DNA
mentally is as a zip. It’s not a perfect analogy, but it will get
us started. Of course, one of the most obvious things that we know
about a zip is that it is formed of two strips facing each other.
This is also true of DNA. The four bases of DNA are the teeth on
the zip. The bases on each side of the zip can link up to each
other chemically and hold the zip together. Two bases facing each
other and joined up like this are known as a base-pair. The fabric
strips that the teeth are stitched on to on a zip are the DNA
backbones. There are always two backbones facing each other, like
the two sides of the zip, and DNA is therefore referred to as
double-stranded. The two sides of the zip are basically twisted
around to form a spiral structure – the famous double helix.
Figure 3.1 is a stylised representation of
what the DNA double helix looks like.
The analogy will only get us so far,
however, and that’s because the teeth of the DNA zip aren’t all
equivalent. If one of the teeth is an A base, it can only link up
with a T base on the opposite strand. Similarly, if there is a G
base on one strand, it can only link up with a C on the other one.
This is known as the base-pairing principle. If an A tried to link
with a C on the opposite strand it would throw the whole shape of
the DNA out of kilter, a bit like a faulty tooth on a
zip.
Keeping it pure
The base-pairing principle is
incredibly important in terms of DNA function. During development,
and even during a lot of adult life, the cells of our bodies
divide. They do this so that organs can get bigger as a baby
matures, for example. They also grow to replace cells that die off
quite naturally. An example of this is the production by the bone
marrow of white blood cells, produced to replace those that are
lost in our bodies’ constant battles with infectious
micro-organisms. The majority of cell types reproduce by first
copying their entire DNA, and then dividing it equally between two
daughter cells. This DNA replication is essential. Without it,
daughter cells could end up with no DNA, which in most cases would
render them completely useless, like a computer that’s lost its
operating software.
It’s the copying of DNA before each cell
division that shows why the base-pairing principle is so important.
Hundreds of scientists have spent their entire careers working out
the details of how DNA gets faithfully copied. Here’s the gist of
it. The two strands of DNA are pulled apart and then the huge
number of proteins involved in the copying (known as the
replication complex) get to work.
Figure 3.2 shows
in principle what happens. The replication complex moves along each
single strand of DNA, and builds up a new strand facing it. The
complex recognises a specific base – base C for example – and
always puts a G in the opposite position on the strand that it’s
building. That’s why the base-pairing principle is so important.
Because C has to pair up with G, and A has to pair up with T, the
cells can use the existing DNA as a template to make the new
strands. Each daughter cell ends up with a new perfect copy of the
DNA, in which one of the strands came from the original DNA
molecule and the other was newly synthesised.
Even in nature, in a system which has
evolved over billions of years, nothing is perfect and occasionally
the replication machinery makes a mistake. It might try to insert a
T where a C should really go. When this happens the error is almost
always repaired very quickly by another set of proteins that can
recognise that this has happened, take out the wrong base and put
in the right one. This is the DNA repair machinery, and one of the
reasons it’s able to act is because when the wrong bases pair up,
it recognises that the DNA ‘zip’ isn’t done up
properly.
The cell puts a huge amount of energy
into keeping the DNA copies completely faithful to the original
template. This makes sense if we go back to our model of DNA as a
script. Consider one of the most famous lines in all of English
literature:
O Romeo, Romeo! wherefore art thou Romeo?
If we insert just one extra letter,
then no matter how well the line is delivered on stage, its effect
is unlikely to be the one intended by the Bard:
O Romeo, Romeo! wherefore fart thou Romeo?
This puerile example illustrates why a
script needs to be reproduced faithfully. It can be the same with
our DNA – one inappropriate change (a mutation) can have
devastating effects. This is particularly true if the mutation is
present in an egg or a sperm, as this can ultimately lead to the
birth of an individual in whom all the cells carry the mutation.
Some mutations have devastating clinical effects. These range from
children who age so prematurely that a ten-year-old has the body of
a person of 70, to women who are pretty much predestined to develop
aggressive and difficult to treat breast cancer before they are 40
years of age. Thankfully, these sorts of genetic mutations and
conditions are relatively rare compared with the types of diseases
that afflict most people.
The 50,000,000,000,000 or so cells in a
human body are all the result of perfect replication of DNA, time
after time after time, whenever cells divide after the formation of
that single-cell zygote from Chapter 1.
This is all the more impressive when we realise just how much DNA
has to be reproduced each time one cell divides to form two
daughter cells. Each cell contains six billion base-pairs of DNA
(half originally came from your father and half from your mother).
This sequence of six billion base-pairs is what we call the genome.
So every single cell division in the human body was the result of
copying 6,000,000,000 bases of DNA. Using the same type of
calculation as in Chapter 1, if we count
one base-pair every second without stopping, it would take a mere
190 years to count all the bases in the genome of a cell. When we
consider that a baby is born just nine months after the creation of
the single-celled zygote, we can see that our cells must be able to
replicate DNA really fast.
The three billion base-pairs we inherit
from each parent aren’t formed of one long string of DNA. They are
arranged into smaller bundles, which are the chromosomes. We’ll
delve deeper into these in Chapter
9.
Reading the script
Let’s go back to the more fundamental
question of what these six billion base-pairs of DNA actually do,
and how the script works. More specifically how can a code that
only has four letters (A, C, G and T) create the thousands and
thousands of different proteins found in our cells? The answer is
surprisingly elegant. It could be described as the modular paradigm
of molecular biology but it’s probably far more useful to think of
it as Lego.
Lego used to have a great advertising
slogan ‘It’s a new toy every day’, and it was very accurate. A
large box of Lego contains a limited number of designs, essentially
a fairly small range of bricks of certain shapes, sizes and
colours. Yet it’s possible to use these bricks to create models of
everything from ducks to houses, and from planes to hippos.
Proteins are rather like that. The ‘bricks’ in proteins are quite
small molecules called amino acids, and there are twenty standard
amino acids (different Lego bricks) in our cells. But these twenty
amino acids can be joined together in an incredible array of
combinations of all sorts of diversity and length, to create an
enormous number of proteins.
That still leaves the problem of how even
as few as twenty amino acids can be encoded by just four bases in
DNA. The way this works is that the cell machinery ‘reads’ DNA in
blocks of three base-pairs at a time. Each block of three is known
as a codon and may be AAA, or GCG or any other combination of A, C,
G and T. From just four bases it’s possible to create sixty-four
different codons, more than enough for the twenty amino acids. Some
amino acids are coded for by more than one codon. For example, the
amino acid called lysine is coded for by AAA and AAG. A few codons
don’t code for amino acids at all. Instead they act as signals to
tell the cellular machinery that it’s at the end of a
protein-coding sequence. These are referred to as stop
codons.
How exactly does the DNA in our
chromosomes act as a script for producing proteins? It does it
through an intermediary protein, a molecule called messenger RNA
(mRNA). mRNA is very like DNA although it does differ in a few
significant details. Its backbone is slightly different from DNA
(hence RNA, which stands for ribonucleic acid rather than
deoxyribonucleic acid); it is single-stranded (only one backbone);
it replaces the T base with a very similar but slightly different
one called U (we don’t need to go into the reason it does this
here). When a particular DNA stretch is ‘read’ so that a protein
can be produced using that bit of script, a huge complex of
proteins unzips the right piece of DNA and makes mRNA copies. The
complex uses the base-pairing principle to make perfect mRNA
copies. The mRNA molecules are then used as temporary templates at
specialised structures in the cell that produce protein. These read
the three letter codon code and stitch together the right amino
acids to form the longer protein chains. There is of course a lot
more to it than all this, but that’s probably sufficient
detail.
An analogy from everyday life may be
useful here. The process of moving from DNA to mRNA to protein is a
bit like controlling an image from a digital photograph. Let’s say
we take a photograph on a digital camera of the most amazing thing
in the world. We want other people to have access to the image, but
we don’t want them to be able to change the original in any way.
The raw data file from the camera is like the DNA blueprint. We
copy it into another format, that can’t be changed very much – a
PDF maybe – and then we email out thousands of copies of this PDF,
to everyone who asks for it. The PDF is the messenger RNA. If
people want to, they can print paper copies from this PDF, as many
as they want, and these paper copies are the proteins. So everyone
in the world can print the image, but there is only one original
file.
Why so complicated, why not just have a
direct mechanism? There are a number of good reasons that evolution
has favoured this indirect method. One of them is to prevent damage
to the script, the original image file. When DNA is unzipped it is
relatively susceptible to damage and that’s something that cells
have evolved to avoid. The indirect way in which DNA codes for
proteins minimises the period of time for which a particular
stretch of DNA is open and vulnerable. The other reason this
indirect method has been favoured by evolution is that it allows a
lot of control over the amount of a specific protein that’s
produced, and this creates flexibility.
Consider the protein called alcohol
dehydrogenase (ADH). This is produced in the liver and breaks down
alcohol. If we drink a lot of alcohol, the cells of our livers will
increase the amounts of ADH they produce. If we don’t drink for a
while, the liver will produce less of this protein. This is one of
the reasons why people who drink frequently are better able to
tolerate the immediate effects of alcohol than those who rarely
drink, who will become tipsy very quickly on just a couple of
glasses of wine. The more often we drink alcohol, the more ADH
protein our livers produce (up to a limit). The cells of the liver
don’t do this by increasing the number of copies of the ADH
gene. They do this by reading the ADH gene more efficiently,
i.e. producing more mRNA copies and/or by using these mRNA copies
more efficiently as protein templates.
As we shall see, epigenetics is one of
the mechanisms a cell uses to control the amount of a particular
protein that is produced, especially by controlling how many mRNA
copies are made from the original template.
The last few paragraphs have all been
about how genes encode proteins. How many genes are there in our
cells? This seems like a simple question but oddly enough there is
no agreed figure on this. This is because scientists can’t agree on
how to define a gene. It used to be quite straightforward – a gene
was a stretch of DNA that encoded a protein. We now know that this
is far too simplistic. However, it’s certainly true to say that all
proteins are encoded by genes, even if not all genes encode
proteins. There are about 20,000 to 24,000 protein-encoding genes
in our DNA, a much lower estimate than the 100,000 that scientists
thought was a good guess just ten years ago1.
Editing the script
Most genes in human cells have quite a
similar structure. There’s a region at the beginning called the
promoter, which binds the protein complexes that copy the DNA to
form mRNA. The protein complexes move along through what’s known as
the body of the gene, making a long mRNA strand, until they finally
fall off at the end of the gene.
Imagine a gene body that is 3,000
base-pairs long, a perfectly sensible length for a gene. The mRNA
will also be 3,000 base-pairs long. Each amino acid is encoded by a
codon composed of three bases, so we would predict that this mRNA
will encode a protein that is 1,000 amino acids long. But, perhaps
unexpectedly, what we find is that the protein is usually
considerably shorter than this.
If the sequence of a gene is typed out it
looks like a long string of combinations of the letters A, C, G and
T. But if we analyse this with the right software, we find that we
can divide that long string into two types of sequences. The first
type is called an exon (for expressed sequence) and an exon
can code for a run of amino acids. The second type is called an
intron (for inexpressed sequence). This doesn’t code for a
run of amino acids. Instead it contains lots of the ‘stop’ codons
that signal that the protein should come to an end.
When the mRNA is first copied from the
DNA it contains the whole run of exons and introns. Once this long
RNA molecule has been created, another multi-sub-unit protein
complex comes along. It removes all the intron sequences and then
joins up the exons to create an mRNA that codes for a continuous
run of amino acids. This editing process is called
splicing.
This again seems extremely complicated,
but there’s a very good reason that this complex mechanism has been
favoured by evolution. It’s because it enables a cell to use a
relatively small number of genes to create a much bigger number of
proteins. The way this works is shown in Figure
3.3.
The initial mRNA contains all the exons
and all the introns. Then it’s spliced to remove the introns. But
during this splicing some of the exons may also be removed. Some
exons will be retained in the final mRNA, others will be skipped
over. The various proteins that this creates may have quite similar
functions, or they may differ dramatically. The cell can express
different proteins depending on what that cell has to do at a
particular time, or because of different signals that it receives.
If we define a gene as something that encodes a protein, this
mechanism means that just 20,000 or so genes can code for far more
than just 20,000 proteins.
Whenever we describe the genome we talk
about it in very two-dimensional terms, almost like a railway
track. Peter Fraser’s laboratory at the Babraham Institute outside
Cambridge has published some extraordinary work showing it’s
probably nothing like this at all. He works on the genes that code
for the proteins required to make haemoglobin, the pigment in red
blood cells that carries oxygen all around the body. There are a
number of different proteins needed to create the final pigment,
and they lie on different chromosomes. Doctor Fraser has shown that
in cells that produce large amounts of haemoglobin, these
chromosome regions become floppy and loop out like tentacles
sticking out of the body of an octopus. These floppy regions mingle
together in a small area of the cell nucleus, waving about until
they can find each other. By doing this, there is an increased
chance that all the proteins needed to create the functional
haemoglobin pigment will be expressed together at the same
time2.
Each cell in our body contains
6,000,000,000 base-pairs. About 120,000,000 of these code for
proteins. One hundred and twenty million sounds like a lot, but
it’s actually only 2 per cent of the total amount. So although we
think of proteins as being the most important things our cells
produce, about 98 per cent of our genome doesn’t code for
protein.
Until recently, the reason that we have
so much DNA when so little of it leads to a protein was a complete
mystery. In the last ten years we’ve finally started to get a grip
on this, and once again it’s connected with regulating gene
expression through epigenetic mechanisms. It’s now time to move on
to the molecular biology of epigenetics.