My first blog post was an introduction to heritability.
At the end of the post I promised I would do another post on exactly how to estimate heritability. This post is the answer to that promise.
This is the post I wish I found when I first googled “How to do a Twin Study” a few years ago.
A question that plagued me during my honours year was ‘How do I actually estimate heritability?’
Of course I knew the answer…
By using a twin study, or some other more modern and sophisticated technique.
I even knew that heritability was “the proportion of variance attributable to genetic effects”.
I also knew that it was “twice the difference between the phenotypic correlation of identical and fraternal twins”.
I even ran the calculations with highly sophisticated computing software.
I understood heritability and I could calculate it using a program.
But I just didn’t know the basic nuts and bolts of how to get it, as in, I couldn’t have taken someone through a simple dummy example of how to estimate heritability.
Now I can.
So for this post, we’ll be going through step by step, as simply and clearly as possible, how to do a ‘classic’ twin study. By the end of this post, you’ll not only understand the basic idea of estimating heritability, but how to confidently estimate it using a sample of twins.
Next time you look at a classical twin study (as we all do at some point right?) estimating heritability, you’ll have a pretty good idea how they got it, and you’ll even be able to critically assess the study yourself… sound good?
FYI: this is an old methodology that has been almost totally superseded since the 1970’s by more complicated, but much more useful twin study methods like Structural Equation Models. But the classic twin study is a nice simple way to better understand heritability and you can do a small example yourself with no more than a pen and notepad… and probably a calculator.
How does the twin study work?
The classical Twin study is one of the oldest and simplest methods for trying to understand how much our genes influence how different we are (heritability), and how much the environment influences how different we are.
The big advantage to twin studies is that they don’t require any actual genetic data. All you need is a reasonably large sample of identical and fraternal (non-identical) twins, a measurement of whatever trait your interested in and some quite sensible and reasonably valid assumptions.
Identical twins are also known as ‘monozygotic twins’ or ‘MZ twins’, because they started as one fertilized egg, which split into two inside their mother’s womb. This is important because it means that they have exactly the same DNA, 100% identical (mostly!).
Fraternal (non-identical) twins are also known as ‘dizygotic twins’ or ‘DZ twins’, because they started as two fertilized eggs, which both somehow managed to implant in their mother’s womb at the same time and grow together. This means they have whatever DNA came packaged into each sperm cell, and each egg, which will not be identical. On average DZ twins share about 50% of their DNA, just like ordinary brothers and sisters.
In fact, the only practical difference between dizygotic twins, and ordinary brothers and sisters, is that they grew in their mother’s womb at the same time, rather than one after the other. This turns out to be very important to twin studies.
So MZ twins DNA is 100% the same, and DZ twins DNA is ~50% the same. This is the key to the twin study.
The fundamental assumption of the twin study is that both MZ and DZ twins pretty much share their environment equally, especially before they’re born. What I mean is a pair of twins, whether it’s DZ or MZ, grow together in the womb and are born together and are then raised together. Therefore any real differences in the environment for a given pair of twins are considered reasonably trivial, especially during their childhood.
What this means is, if we assume that the shared environment between MZ and DZ twins is roughly equal, then the only real influence on how different a pair of twins are from each other (their ‘phenotypic variation’) is their genetic differences (their ‘genetic variation’) and ‘unshared’ environmental variation.
So if we go back to our understanding of heritability there are, more or less, two sources of variation for individuals: genes and their environment.
So we can summarize the influence of genes and environment on individual differences with this super complicated mathematical equation:
P = G + E
So the P is short for ‘phenotype’ which is a fancy word for what makes you, you. Another word is ‘trait’ or ‘characteristic’. Your phenotype can be whatever you decide to measure like height, weight, hair colour, the presence or absence of a given disease, anything you can think of.
Hopefully it’s obvious that G stands for ‘Genetic’ and E stands for ‘Environment’.
More importantly the P, G and E are actually measurements of the ‘variation’ of each one. So ‘P’ is a measure of the total amount of variation between a sample of twins (this is the thing we will measure when we do our twin study).
In reality we can’t measure every single aspect of every individuals environment, which would be completely unethical even if we could (Mark Zuckerberg disagrees.)
This is why our assumption is so crucial.
Because we’re assuming that all twin pairs have about the same environmental influence, then we can estimate G.
Note: I could (and probably should) get a little more complicated here and split these two up into their sub-categories, but I’m not. For example the genetic variation should really be split into three subtypes, additive genetic variation, dominance and epistatic variation… you either know what those things are or you don’t, either way, we’re just going to put them all together in our simple toy example and take them all as the same.
So where are we?
Oh yes, variation.
So because E is equal for MZ and DZ twins, we can use our knowledge of the genetics of MZ and DZ twins, and our measurements of their phenotype to estimate how much genetics influence their phenotypic variation.
Time to get serious.
How to do a Twin Study
So getting back to our equation.
P = G + E
How do we figure out the value of P, G and E?
Well P is easy we can measure it.
Note: I use the word ‘easy’ extremely tentatively. By easy I mean in theory it’s something we have access to; we can take measurements, and that’s our phenotype. In reality collecting and cataloguing good, accurate, useful measurements of any kind is one of the fundamental challenges of statistics and science in general and is one of the central considerations for experimental design. Variation and human error are some of the primary reasons that statistics are necessary in the first place.
As I’ve already mentioned, E is practically impossible to measure.
And G is what this is all about.
Twin studies are designed to allow us to estimate G by looking at how different a group of MZ twins is from a group of DZ twins.
So now we have two equations. One for MZ’s and one for DZ’s (hopefully the notation is obvious):
Pmz = Gmz + Emz
Pdz = Gdz + Edz
We’re about to do some seriously clever shit. Ready to jump down the rabbit hole?
This is where everything comes together. We don’t know what Gmz, Gdz, Emz or Edz are. But what we do know is that MZ twins are 100% genetically identical and DZ twins are ~50% identical.
So we can make the equations look like this:
Pmz = G + Emz
Pdz = G/2 + Edz
Are you with me? So Gdz is approximately half of Gmz.
Now remember we’re also assuming that both MZ twins and DZ twins shared environment is roughly equal. So:
Emz = Edz, so we can just call it E.
Without this assumption, the twin study won’t work, as we’ll see.
So what are we trying to do? We want to know what G is. Twin study does this by using the difference between Pmz and Pdz combined with all of the other information we just discussed.
Now, if you’re a little rusty on high school algebra then you’re just going to have to trust me that this next bit works and don’t freak out too much. But here’s what we can do with our above equations. First lets join them together by taking the bottom equation away from the top equation.
Pmz – Pdz = G + E – (G/2 + E)
If we expand out the bracket we have to change the + sign to a – sign, so it looks like this:
Pmz – Pdz = G + E – G/2 – E
Let’s rearrange it a bit to make the next step more obvious:
Pmz – Pdz = G – G/2 + E – E
This is where our shared environment assumption is key. As you can see, it’s silly to both add E, and then take it away. Because we’re assuming Emz = Edz, we can simply cancel it out.
So we’re left with:
Pmz – Pdz = G – G/2
What’s G minus half of G? It’s just half of G right? So now we have:
Pmz – Pdz = G/2
Another clever algebra trick and we’ve made it… Multiply both sides by two:
2(Pmz – Pdz) = G
Boom! We have a formula for G.
Now we have to figure out (estimate as best we can), the values of Pmz and Pdz.
First we have to understand what Pmz and Pdz actually are.
OK, so we know that they are some measure of ‘phenotype’.
In the real world, they are the average amount of similarity between each pair of twins in a large sample. So how physically similar is a twin from its sibling, and what’s the average similarity for a whole group. We have to find this average for MZ twins, and for DZ twins.
Statistically they each represent the phenotypic correlation for MZ and DZ twins respectively.
That will be way clearer after a nice big example, which is what the next blog post will be all about.