May 29, 2020

Suppose everyone in the United States wanted to join the same video call? If each of the 330 million participants joins and drops at a random time, how likely is it that at least one person overlaps with everyone else? This problem easily exceeds the limits of a practical simulation, but we'll write some code to develop intuition about the results before attempting to solve it analytically. Here's the full problem text from this week's Riddler.

One Friday morning, suppose everyone in the U.S. (about 330 million people) joins a single Zoom meeting between 8 a.m. and 9 a.m. — to discuss the latest Riddler column, of course. This being a virtual meeting, many people will join late and leave early.In fact, the attendees all follow the same steps in determining when to join and leave the meeting. Each person independently picks two random times between 8 a.m. and 9 a.m. — not rounded to the nearest minute, mind you, but any time within that range. They then join the meeting at the earlier time and leave the meeting at the later time.

What is the probability that at least one attendee is on the call with everyone else (i.e., the attendee’s time on the call overlaps with every other person’s time on the call)?

Extra credit:What is the probability that at least two attendees are on the call with everyone else?

This is a challenging problem, in part because it exceeds the limits of a simulation that could run in reasonable time, and also because the underlying probability distributions are a little tricky. I'm still working on an analytical solution.

The following image helped me think about the probability distributions involved. Each blue bar represents one participant's time on the call, and we quickly build a stack of participants joining and dropping over time. However, we are especially interested in two participants: the person that *leaves the call first*, and the person that *arrives to the call last*.

These two people set the boundary on the minimum overlap time. In order to overlap with everyone, at least one person must *join* the call before the first person leaves, and must *stay* on the call until the last person joins. If those two conditions are met, then we're guaranteed to have at least one overlap.

While it's impractical to simulate the entire 330-million-person call, I found that simulations with smaller values (100,000 to 50 million or more) consistently yielded values around 66% for at least one overlap, and around 40% for at least two overlaps.

I also had some fun this week playing with dask to run large simulations on my machine. Dask lets you create a "local cluster" on your computer that can parallelize certain operations and run them much faster than you could on a single thread.

The code below creates a dask local cluster, then runs a large batch of simulations that align with the analytical results above.

```
from dask import array as da
from dask.distributed import Client
# start a local cluster using all CPU threads
client = Client()
def model(trials: int, participants: int):
"""
Returns the number of participants on a call who overlap
with every other participant. This is solved efficiently
using dask arrays for parallel simulations:
1. Generate each person's start and end times. The start
time is the min and the end time is the max.
2. For a group of participants, find the latest start
time and the earliest end time.
3. Identify any participants that have a starting time
before than the earliest end time and an end time
after the latest start time.
Returns
-------
result : an array of the number of participants in each
trial who overlap every other participant.
"""
times = da.random.uniform(size=(trials, participants, 2))
starts = times.min(axis=2)
ends = times.max(axis=2)
latest_start = starts.max(axis=1)
earliest_end = ends.min(axis=1)
# check if any participants arrive before the first person
# leaves and leave after the last person arrives.
cond1 = starts.T < earliest_end
cond2 = ends.T > latest_start
return (cond1 & cond2).sum(axis=0)
# results here is a delayed object. Call `compute() for values
# here we run 1000 trials with 10 million participants
results = model(trials=1000, participants=10000000)
results.compute()
```