<- 30
n <- sample(x = 1:365, size = n, replace = TRUE) bdays
Homework 3: Probability, Statistics and Machine Learning
Spring 2024 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu
- Note: For any simulation or random sampling, set the random seed at your student ID number, for example
set.seed(6145678)
.
1 Probability and Statistics
1.1 Monte Carlo Simulation
- Suppose you are in a classroom with 30 people. If we assume this is a randomly selected group of 30 people, what is the chance that at least two people have the same birthday? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29.
- Note that birthdays can be represented as numbers between 1 and 365, so a sample of 30 birthdays can be obtained like this:
- To check if in this particular set of 30 people we have at least two with the same birthday, we can use the function
duplicated()
, which returnsTRUE
whenever an element of a vector is a duplicate. Here is an example:
duplicated(c(1, 2, 3, 1, 4, 3, 5))
[1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE
The second time 1 and 3 appear, we get a TRUE
.
- To check if two birthdays were the same, we simply use the
any()
andduplicated()
functions like this:
any(duplicated(bdays))
[1] FALSE
In this case, we see that it did happen. At least two people had the same birthday.
To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 30 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday.
## code
# set.seed(your ID number)
1.2 Central Limit Theorem
Suppose random variables \(X_1, X_2, \dots, X_n\) are independent and follow Chi-squared distribution with degrees of freedom 1, \(\chi^2_{df=1}\).
- Use
dchisq()
to plot \(\chi^2_{df=1}\) distribution. Consider \(x\in (0, 5)\).
## code
- Consider three sample sizes \(n = 2, 8, 100\), and set the sample size of the sample mean \(\overline{X}_n\) be \(1000\). Show the sampling distribution of \(\overline{X}_n\), i.e., the collection \(\{\overline{X}_n^{(m)}\}_{m=1}^{1000}\), looks more and more like Gaussian as \(n\) increases by making histograms of \(\overline{X}_n\) samples with \(n = 2, 8, 100\). The procedure is the following:
For each \(n = 2, 8, 100\),
- Draw \(n\) values \(x_1, x_2, \dots, x_n\) using
rchisq(n, df = 1)
. - Compute the mean of the \(n\) values, which is \(\overline{x}_n\).
- Repeat i. and ii. 1000 times to obtain 1000 \(\overline{x}_n\)s.
- Plot the histogram of these 1000 \(\overline{x}_n\)s.
## code
# set.seed(your ID number)
2 Machine Learning
2.1 Linear Regression
A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained.
- Import
dose.csv
into your working session. The data set is not tidy. Usepivot_longer()
to make it tidy as the shown tibble below. Call the tidy data setdose_tidy
.
## code
## # A tibble: 15 × 3
## dose_level tube response
## <dbl> <chr> <dbl>
## 1 2 tube1 5
## 2 2 tube2 7
## 3 2 tube3 3
## 4 4 tube1 10
## 5 4 tube2 12
## 6 4 tube3 14
## 7 8 tube1 15
## 8 8 tube2 17
## 9 8 tube3 18
## 10 16 tube1 20
## 11 16 tube2 21
## 12 16 tube3 19
## 13 32 tube1 23
## 14 32 tube2 24
## 15 32 tube3 29
- Fit a simple linear regression with the predictor \(\texttt{dose level}\) for
response
. Print the fitted result.
## code
- With (2), plot the data with a \(95\%\) confidence interval for the mean response.
## code
- Fit a simple linear regression model with the predictor \(\texttt{ln(dose level)}\) for
response
, where \(\ln = \log_e\). Print the fitted result.
## code
- With (4), plot the data \((\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15\) with a \(95\%\) confidence interval for the mean response.
## code
- Draw residual plots of Model in (2) and (4). According to the plots, which model you think is better?
## code
- Import
dose_tidy.csv
and redo (2) using Python. Show the slope and intercept.
# code
- Use Python to predict the response value when the dose level is 10 and 30.
# code
2.2 Logistic Regression
- Import
body.csv
. Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule.
# code
# set.seed(your ID number)
- Fit a logistic regression with the predictor
HEIGHT
using the training sample data. Find the probability that the subject is male givenHEIGHT = 165
.
# code
- Fit a logistic regression with the predictor
BMI
using the training sample data. Find the probability that the subject is male givenBMI = 25
.
# code
- Do the classification on the test set for the model (2) and (3), and compute the test accuracy rate. Which model gives us higher accuracy rate?
# code
- Use Python to split the
body
data into a training set and a test set.
## code
- Use Python to fit a logistic regression with the predictor
BMI
using the training sample data. Find the probability that the subject is male givenBMI = 25
.
# code
- Use Python to do the classification on the test set. Compute the test accuracy rate.
# code
2.3 K-Nearest Neighbors (KNN)
- Fit the KNN with \(K=1\) and \(10\) using
BMI
on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix for the two \(K\)s. Which \(K\) performs better? Why?
# code