A colleague writes (extract):

I have an assessment with 4 sections in it A,B,C and D.

I have a question bank for each section. The number of questions in each bank is A-10, B-10, C-5, D-5.

In my assessment I will print out randomly a fixed number of questions from each bank. Section A will have 5 questions, B-5, C-2, D-2. 14 questions in total appear on the exam.

I can figure out how many different exam papers (order doesn’t matter) can be generated (I think!).

\displaystyle \binom{10}{5}\cdot \binom{10}{5}\cdot \binom{5}{2}\cdot \binom{5}{2}=6350400

But my question is: what is the uniqueness of each exam, or what overlap between exams can be expected.?

I am not trying to get unique exams for everyone (unique as in no identical questions) but would kinda like to know what is the overlap.

Following the same argument as here we can establish that:

Fact 1

The expected number of students to share an exam is \approx 0.00006.

Let the number of exams \alpha:=6350400.

This is an approach that takes advantage of the fact that expectation is linear, and the probability of an event E not happening is

\displaystyle\mathbb{P}[\text{not-}E]=1-\mathbb{P}[E].

Label the 20 students by i=1,\dots,20 and define a random variable S_i by

\displaystyle S_i=\left\{\begin{array}{cc}1&\text{ if student i has the same exam as someone elese} \\ 0 & \text{ if student i has a unique exam}\end{array}\right.

Then X, the number of students who share an exam, is given by:

\displaystyle X=S_1+S_2+\cdots+S_{20},

and we can calculate, using the linearity of expectation.

\mathbb{E}[X]=\mathbb{E}[S_1]+\cdots \mathbb{E}[S_{20}].

The S_i are not independent but the linearity of expectation holds even when the addend random variables are not independent… and each of the S_i has the same expectation. Let p be the probability that student i does not share an exam with anyone else; then

\displaystyle\mathbb{E}[S_i]=0\times\mathbb{P}[S_i=0]+1\times \mathbb{P}[S_i=1],

but \displaystyle\mathbb{P}[S_i=0]=\mathbb{P}[\text{ student i does not share an exam}]=p, and

\displaystyle \mathbb{P}[S_i=1]=\mathbb{P}[\text{not-}(S_i=0)]=1-\mathbb{P}[S_i=0]=1-p,

and so

\displaystyle\mathbb{E}[S_i]=1-p.

All of the 20 S_i have this same expectation and so

\displaystyle\mathbb{E}[X]=20\cdot (1-p).

Now, what is the probability that nobody shares student i‘s exam?

We need students 1\rightarrow i-1 and i+1\rightarrow 20 — 19 students — to have different exams to student i, and for each there is \alpha-1 ways of this happening, and we do have independence here (student 1 not sharing student i‘s exam doesn’t change the probability of student 2 not sharing student i‘s exam), and so \mathbb{P}[\text{(student j not sharing) AND (student k not sharing)}] is the product of the probabilities.

So we have that

\displaystyle p=\left(\frac{\alpha-1}{\alpha}\right)^{19},

and so the answer to the question is:

\displaystyle\mathbb{E}[X]=20\cdot \left(1-\left(\frac{\alpha-1}{\alpha}\right)^{19}\right)\approx 0.00005985\approx 0.00006.

We can get an estimate for the probability that two or more students share an exam using Markov’s Inequality:

\displaystyle\mathbb{P}[X\geq 2]\leq \frac{\mathbb{E}[X]}{2}\approx 0.00003=0.003\%

Fact 2

This estimate is tight: the probability that two or more students (out of 20) share an exam is about 0.003%.

This tallies very well with the exact probability which can be found using a standard Birthday Problem argument (see the solution to Q. 7 here) to be:

\mathbb{P}[X\geq 2]\approx 0.0000299191\approx 0.003\%

The probability that two given students share an exam is 1/\alpha\approx 0.00001575\%

Fact 3

The expected number of shared questions between two students is 6.6

Take students 1 and 2. The questions are in four bins: two of ten, two of five. Let B_i be the number of questions in bin i that students 1 and 2 share. The expected number of shared questions, Q, is:

\displaystyle \mathbb{E}[Q]=\sum_{i=1}^4\mathbb{E}[B_i],

and the numbers are small enough to calculate the probabilities exactly using the hypergeometric distribution.

The calculations for bins 1 and 2, and bins 3 and 4 are the same. The expectation

\displaystyle\mathbb{E}[B_1]=\sum_{j=0}^5j\mathbb{P}[B_1=j].

Writing briefly p_j=\mathbb{P}[B_1=j], looking at the referenced hypergeometric distribution we find:

\displaystyle p_j=\frac{\binom{5}{j}\binom{5}{5-j}}{\binom{10}{5}}

and we find:

\displaystyle\mathbb{E}[B_1]=\mathbb{E}[B_2]=\frac52

Similarly we see that

\displaystyle\mathbb{E}[B_3]=\mathbb{E}[B_4]=\frac{4}{5}

and so, using linearity:

\displaystyle\mathbb{E}[Q]=\frac52+\frac52+\frac45+\frac45=6.6

This suggests that on average students share about 50% of the question paper. Markov’s Inequality gives:

\displaystyle\mathbb{P}[Q\geq 7]\underset{\approx}{\leq} 0.9429,

but I do not believe this is tight.

Calculating this probability exactly is tricky because there are many different ways that students can share a certain number of questions. We would be looking at something like “multiple hypergeometric”, and I would calculate it as the event not-(0 or 1 or 2 or 3 or 4 or 5 or 6).

I think the \mathbb{E}[Q]=6.6 result is striking enough at this time!