---
jupytext:
  formats: ipynb,md:myst,py:light
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.13.8
kernelspec:
  display_name: Python 3 (math-583)
  language: python
  name: math-583
---

```{code-cell}
:tags: [hide-cell]

import mmf_setup; mmf_setup.nbinit()
import os
from pathlib import Path
FIG_DIR = Path(mmf_setup.ROOT) / '../Docs/_build/figures/'
os.makedirs(FIG_DIR, exist_ok=True)
import logging; logging.getLogger("matplotlib").setLevel(logging.CRITICAL)
%matplotlib inline
import numpy as np, matplotlib.pyplot as plt
try: from myst_nb import glue
except: glue = None
```

(sec:Darwin)=
# Galton and Darwin

In **More Food for Fair Thought** in Chapter 6 of {cite}`Schroeder:1991`, Schroeder
discusses an experiment in which Charles Darwin was trying to see if there was a
difference between the growth of self-fertilized and cross-fertilized plants.  He
provided his half-cousin, Francis Galton, 30 measurements of the heights of plants, 15
of each type.  Galton performed a rank-order comparison, concluding: *"We
may...confidently affirm that a crossed series will always be found to exceed a
self-fertilised series, within the range of the conditions under which the present
experiment has been made."*

Discuss the validity of Galton's claim.  Support your answer by proposing a model and
simulating with a random number generator.  Compare your results with the bootstrap
method mentioned by Schroeder.

:::{margin}
![Zea mays](https://upload.wikimedia.org/wikipedia/commons/e/e3/Zea_mays_-_K%C3%B6hler%E2%80%93s_Medizinal-Pflanzen-283.jpg)
:::
## The Growth of Zea Mays

The growth is for a species of corn ([Zea
mays](https://en.wikipedia.org/wiki/Maize)). The following table is from [Darwin's 
Notes](http://darwin-online.org.uk/converted/published/1876_CrossandSelfFertilisation_F1249/1876_CrossandSelfFertilisation_F1249.html).
Heights are in inches.


| **Pot 1:** Crossed | Self-fert. | **Pot 2**  Crossed | Self-fert. | **Pot 3** Crossed | Self-fert. | **Pot 4** Crossed | Self-fert. |
|---------|------------|---------|------------|---------|------------|---------|------------|
| 23 4/8  | 17 3/8     | 22      | 20         | 22 1/8  | 18 5/8     | 21      | 18         |
| 12      | 20 3/8     | 19 1/8  | 18 3/8     | 20 1/8  | 15 2/8     | 22 1/8  | 12 6/8     |
| 21      | 20         | 21 4/8  | 18 5/8     | 18 2/8  | 16 4/8     | 23      | 15 4/8     |
|         |            |         |            | 21 5/8  | 18         | 12      | 18         |
|         |            |         |            | 23 2/8  | 16 2/8     |         |            |

> I have examined the measurements of the plants with care, and by many statistical
> methods, to find out how far the means of the several sets represent constant
> realities, such as would come out the same so long as the general conditions of growth
> remained unaltered. The principal methods that were adopted are easily explained by
> selecting one of the shorter series of plants, say of Zea mays, for an example."
>
> The observations as I received them are shown in columns II. and III., where they
> certainly have no primâ facie appearance of regularity. But as soon as we arrange them
> the in order of their magnitudes, as in columns IV. and V., the case is materially
> altered. We now see, with few exceptions, that the largest plant on the crossed side
> in each pot exceeds the largest plant on the self-fertilised side, that the second
> exceeds the second, the third the third, and so on. Out of the fifteen cases in the
> table, there are only two exceptions to this rule. We may therefore confidently affirm
> that a crossed series will always be found to exceed a self-fertilised series, within
> the range of the conditions under which the present experiment has been made.

Here is a rudimentary analysis of the data:  I plot the heights, then compute the mean
and standard deviation (assuming a normal distribution, shown on the right axis).

```{code-cell}
h = height_inches = np.array(
    [[23+4/8, 17+3/8],
     [12, 20+3/8],
     [21, 20],
     [22, 20],
     [19+1/8, 18+3/8],
     [21 + 4/8, 18 + 5/8],
     [22 + 1/8, 18 + 5/8],
     [20 + 3/8, 15 + 2/8],
     [18 + 2/8, 16 + 4/8],
     [21 + 5/8, 18],
     [23 + 2/8, 16 + 2/8],
     [21, 18],
     [22 + 1/8, 12 + 6/8],
     [23, 15 + 4/8],
     [12, 18]])

import scipy as sp

def plot(data, extend=5):
    N = len(data)
    fig, axs = plt.subplots(
        1, 2, sharey=True, 
        gridspec_kw=dict(width_ratios=(1, 0.2), 
                         wspace=0.01))


    ax = axs[0]
    ax.bar(np.arange(N)-0.2, data[:, 0], width=0.4, label="Cross fertilized");
    ax.bar(np.arange(N)+0.2, data[:, 1], width=0.4, label="Self fertilized");
    ax.set(ylabel="Height (inches)")
    ax.legend(loc='upper left')

    ax = axs[1]
    ax.set(frame_on=False)
    ax.hist(data, orientation='horizontal', density=True, alpha=0.3)
    ys = np.linspace(data.min()-extend, data.max()+extend)
    dists = [sp.stats.norm(loc=m, scale=s) 
             for m, s in zip(data.mean(axis=0), data.std(axis=0))]
    for n, dist in enumerate(dists):
        ax.plot(dist.pdf(ys), ys, c=f"C{n}", lw=2)

plot(height_inches)
plt.suptitle("Darwin's data")

from uncertainties import ufloat
H = [ufloat(m, s) for m, s in zip(h.mean(axis=0), h.std(axis=0))]
print(f"Crossed: h={H[0]:s}in, Self-fertilized: h={H[1]:s}in")
```

Galton sorted the data (rank order) and then compared each rank-ordered pair.

```{code-cell}
# Rank order all the data
inds = np.argsort(h, axis=0)
sorted_height_inches = np.concatenate([[h[i]] for i, h in zip(inds.T, h.T)]).T
plot(sorted_height_inches)
```

After seeing that 13 of the cross-fertilized plants were taller after rank ordering,
Galton *"confidently affirm[ed] that a crossed series will always be found to exceed a
self-fertilised series"* for the conditions realized in Darwin's experiment?  Are you convinced?

```{code-cell}
:tags: [hide-cell]

# I am not convinced...
rng = np.random.default_rng(seed=2)
N = 15
h = 20 + 4 * rng.normal(size=(N, 2))
inds = np.argsort(h, axis=0)
sorted_h = np.concatenate([[h[i]] for i, h in zip(inds.T, h.T)]).T
plot(sorted_h)
plt.suptitle("Random data");
```

```{code-cell}
:tags: [hide-cell]

rng = np.random.default_rng(seed=2)
N = 15
h = 20 + 4 * rng.normal(size=(N, 2))
inds = np.argsort(h, axis=0)
sorted_h = np.concatenate([[h[i]] for i, h in zip(inds.T, h.T)]).T
plot(sorted_h)
plt.suptitle("Random data");
```

## References

* https://arxiv.org/abs/2102.02572
