Homework 3: Probability, Statistics and Machine Learning

Spring 2024 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu

Author

Insert Your Name!!

Published

May 2, 2024

1 Probability and Statistics

1.1 Monte Carlo Simulation

  1. Suppose you are in a classroom with 30 people. If we assume this is a randomly selected group of 30 people, what is the chance that at least two people have the same birthday? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29.
  1. Note that birthdays can be represented as numbers between 1 and 365, so a sample of 30 birthdays can be obtained like this:
n <- 30
bdays <- sample(x = 1:365, size = n, replace = TRUE)
  1. To check if in this particular set of 30 people we have at least two with the same birthday, we can use the function duplicated(), which returns TRUE whenever an element of a vector is a duplicate. Here is an example:
duplicated(c(1, 2, 3, 1, 4, 3, 5))
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

The second time 1 and 3 appear, we get a TRUE.

  1. To check if two birthdays were the same, we simply use the any() and duplicated() functions like this:
any(duplicated(bdays))
[1] FALSE

In this case, we see that it did happen. At least two people had the same birthday.

To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 30 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday.

## code
# set.seed(your ID number)

1.2 Central Limit Theorem

Suppose random variables \(X_1, X_2, \dots, X_n\) are independent and follow Chi-squared distribution with degrees of freedom 1, \(\chi^2_{df=1}\).

  1. Use dchisq() to plot \(\chi^2_{df=1}\) distribution. Consider \(x\in (0, 5)\).
## code
  1. Consider three sample sizes \(n = 2, 8, 100\), and set the sample size of the sample mean \(\overline{X}_n\) be \(1000\). Show the sampling distribution of \(\overline{X}_n\), i.e., the collection \(\{\overline{X}_n^{(m)}\}_{m=1}^{1000}\), looks more and more like Gaussian as \(n\) increases by making histograms of \(\overline{X}_n\) samples with \(n = 2, 8, 100\). The procedure is the following:

For each \(n = 2, 8, 100\),

  1. Draw \(n\) values \(x_1, x_2, \dots, x_n\) using rchisq(n, df = 1).
  2. Compute the mean of the \(n\) values, which is \(\overline{x}_n\).
  3. Repeat i. and ii. 1000 times to obtain 1000 \(\overline{x}_n\)s.
  4. Plot the histogram of these 1000 \(\overline{x}_n\)s.
## code
# set.seed(your ID number)

2 Machine Learning

2.1 Linear Regression

A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained.

  1. Import dose.csv into your working session. The data set is not tidy. Use pivot_longer() to make it tidy as the shown tibble below. Call the tidy data set dose_tidy.
## code

## # A tibble: 15 × 3
##    dose_level tube  response
##         <dbl> <chr>    <dbl>
##  1          2 tube1        5
##  2          2 tube2        7
##  3          2 tube3        3
##  4          4 tube1       10
##  5          4 tube2       12
##  6          4 tube3       14
##  7          8 tube1       15
##  8          8 tube2       17
##  9          8 tube3       18
## 10         16 tube1       20
## 11         16 tube2       21
## 12         16 tube3       19
## 13         32 tube1       23
## 14         32 tube2       24
## 15         32 tube3       29
  1. Fit a simple linear regression with the predictor \(\texttt{dose level}\) for response. Print the fitted result.
## code
  1. With (2), plot the data with a \(95\%\) confidence interval for the mean response.
## code
  1. Fit a simple linear regression model with the predictor \(\texttt{ln(dose level)}\) for response, where \(\ln = \log_e\). Print the fitted result.
## code
  1. With (4), plot the data \((\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15\) with a \(95\%\) confidence interval for the mean response.
## code
  1. Draw residual plots of Model in (2) and (4). According to the plots, which model you think is better?
## code
  1. Import dose_tidy.csv and redo (2) using Python. Show the slope and intercept.

# code
  1. Use Python to predict the response value when the dose level is 10 and 30.

# code

2.2 Logistic Regression

  1. Import body.csv. Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule.
# code
# set.seed(your ID number)
  1. Fit a logistic regression with the predictor HEIGHT using the training sample data. Find the probability that the subject is male given HEIGHT = 165.
# code
  1. Fit a logistic regression with the predictor BMI using the training sample data. Find the probability that the subject is male given BMI = 25.
# code
  1. Do the classification on the test set for the model (2) and (3), and compute the test accuracy rate. Which model gives us higher accuracy rate?
# code
  1. Use Python to split the body data into a training set and a test set.

## code
  1. Use Python to fit a logistic regression with the predictor BMI using the training sample data. Find the probability that the subject is male given BMI = 25.

# code
  1. Use Python to do the classification on the test set. Compute the test accuracy rate.

# code

2.3 K-Nearest Neighbors (KNN)

  1. Fit the KNN with \(K=1\) and \(10\) using BMI on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix for the two \(K\)s. Which \(K\) performs better? Why?
# code