MATH/COSC 3570 Introduction to Data Science
Mar 29, 2022
Mar 22, 2023
Mar 18, 2024
Probability is the study of chance, the language of uncertainty.
We could do data science without any probability involved. However, what we can learn from data will be much limited. Why?
Every time you collect a data set, you obtain a different one.
Your data are affected by some chance or random noises!
Knowledge of probability is essential for data science, especially when we want to quantify uncertainty about what we learn from our data.
The probability that some outcome of a process will be obtained is the relative frequency with which that outcome would be obtained if the process were repeated a large number of times independently under similar conditions.
Example:
Frequency Relative_Frequency
Heads 6 0.6
Tails 4 0.4
Total 10 1.0
---------------------
Frequency Relative_Frequency
Heads 535 0.54
Tails 465 0.47
Total 1000 1.00
---------------------
Without peeking at the bag, how do we approximate the probability of getting a red ball?
Monte Carlo Simulation: Repeat drawing a ball at random a large number of times to approximate the probability by the relative frequency of getting a red ball.
So how many red balls in the bag?
set.seed()
set.seed()
rnorm(n, mean, sd)
: Draw \(n\) observations from a normal distribution with mean mean
and standard deviation sd
.dnorm(x, mean, sd)
to compute the density value \(f(x)\) (NOT probability)pnorm(q, mean, sd)
to compute \(P(X \leq q)\)
pnorm(q, mean, sd, lower.tail = FALSE)
to compute \(P(X > q)\)
pnorm(q2, mean, sd) - pnorm(q1, mean, sd)
to compute \(P(q_1\leq X \leq q_2)\)
18-probability
In lab.qmd ## Lab 18
section,
To use ggplot,
dbinom(x, size = ___, prob = ___)
.# A tibble: 6 × 2
x y
<int> <dbl>
1 0 0.168
2 1 0.360
3 2 0.309
4 3 0.132
5 4 0.0284
6 5 0.00243
2. Add geom_col()
Population (Data generating process): a group of subjects we are interested in studying
Sample (Data): a (representative) subset of our population of interest
Parameter: a unknown fixed numerical quantity derived from the population 1
Statistic: a numerical quantity derived from a sample
Common population parameters of interest and their corresponding sample statistic:
Quantity | Parameter | Statistic (Point estimate) |
---|---|---|
Mean | \(\mu\) | \(\overline{x}\) |
Variance | \(\sigma^2\) | \(s^2\) |
Standard deviation | \(\sigma\) | \(s\) |
proportion | \(p\) | \(\hat{p}\) |
What if \(\sigma\) is unknown?
If we were able to collect our sample data many times and build the corresponding confidence intervals, we would expect about 95% of those intervals would contain the true population parameter.
However,
We never know if in fact 95% of them do, or whether any particular interval contains the true parameter! 😱
❌ Cannot say “There is a 95% chance/probability that the true parameter is in the confidence interval.”
In practice we may only be able to collect one single data set.
\(X_1, \dots, X_n \sim N(\mu, \sigma^2)\) where \(\mu = 120\) and \(\sigma = 5\).
Algorithm
mu <- 120; sig <- 5
al <- 0.05; M <- 100; n <- 16
set.seed(2024)
x_rep <- replicate(M, rnorm(n, mu, sig))
xbar_rep <- apply(x_rep, 2, mean)
E <- qnorm(p = 1 - al / 2) * sig / sqrt(n)
ci_lwr <- xbar_rep - E
ci_upr <- xbar_rep + E
plot(NULL, xlim = range(c(ci_lwr, ci_upr)), ylim = c(0, 100),
xlab = "95% CI", ylab = "Sample", las = 1)
mu_out <- (mu < ci_lwr | mu > ci_upr)
segments(x0 = ci_lwr, y0 = 1:M, x1 = ci_upr, col = "navy", lwd = 2)
segments(x0 = ci_lwr[mu_out], y0 = (1:M)[mu_out], x1 = ci_upr[mu_out], col = 2, lwd = 2)
abline(v = mu, col = "#FFCC00", lwd = 2)
random.Generator.choice()
: Generates a random sample from a given array