Homework 3: Probability, Statistics and Machine Learning

Spring 2024 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu

Author

Insert Your Name!!

Published

May 2, 2024

Note: For any simulation or random sampling, set the random seed at your student ID number, for example set.seed(6145678).

1 Probability and Statistics

1.1 Monte Carlo Simulation

Suppose you are in a classroom with 30 people. If we assume this is a randomly selected group of 30 people, what is the chance that at least two people have the same birthday? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29.

Note that birthdays can be represented as numbers between 1 and 365, so a sample of 30 birthdays can be obtained like this:

n <- 30
bdays <- sample(x = 1:365, size = n, replace = TRUE)

To check if in this particular set of 30 people we have at least two with the same birthday, we can use the function duplicated(), which returns TRUE whenever an element of a vector is a duplicate. Here is an example:

duplicated(c(1, 2, 3, 1, 4, 3, 5))

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

The second time 1 and 3 appear, we get a TRUE.

To check if two birthdays were the same, we simply use the any() and duplicated() functions like this:

any(duplicated(bdays))

[1] FALSE

In this case, we see that it did happen. At least two people had the same birthday.

To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 30 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday.

## code
# set.seed(your ID number)

1.2 Central Limit Theorem

Suppose random variables $X_1, X_2, \dots, X_n$ are independent and follow Chi-squared distribution with degrees of freedom 1, $\chi^2_{df=1}$.

Use dchisq() to plot $\chi^2_{df=1}$ distribution. Consider $x\in (0, 5)$.

## code

Consider three sample sizes $n = 2, 8, 100$, and set the sample size of the sample mean $\overline{X}_n$ be $1000$. Show the sampling distribution of $\overline{X}_n$, i.e., the collection $\{\overline{X}_n^{(m)}\}_{m=1}^{1000}$, looks more and more like Gaussian as $n$ increases by making histograms of $\overline{X}_n$ samples with $n = 2, 8, 100$. The procedure is the following:

For each $n = 2, 8, 100$,

Draw $n$ values $x_1, x_2, \dots, x_n$ using rchisq(n, df = 1).
Compute the mean of the $n$ values, which is $\overline{x}_n$.
Repeat i. and ii. 1000 times to obtain 1000 $\overline{x}_n$s.
Plot the histogram of these 1000 $\overline{x}_n$s.

## code
# set.seed(your ID number)

2 Machine Learning

2.1 Linear Regression

A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained.

Import dose.csv into your working session. The data set is not tidy. Use pivot_longer() to make it tidy as the shown tibble below. Call the tidy data set dose_tidy.

## code

## # A tibble: 15 × 3
##    dose_level tube  response
##         <dbl> <chr>    <dbl>
##  1          2 tube1        5
##  2          2 tube2        7
##  3          2 tube3        3
##  4          4 tube1       10
##  5          4 tube2       12
##  6          4 tube3       14
##  7          8 tube1       15
##  8          8 tube2       17
##  9          8 tube3       18
## 10         16 tube1       20
## 11         16 tube2       21
## 12         16 tube3       19
## 13         32 tube1       23
## 14         32 tube2       24
## 15         32 tube3       29

Fit a simple linear regression with the predictor $\texttt{dose level}$ for response. Print the fitted result.

## code

With (2), plot the data with a $95\%$ confidence interval for the mean response.

## code

Fit a simple linear regression model with the predictor $\texttt{ln(dose level)}$ for response, where $\ln = \log_e$. Print the fitted result.

## code

With (4), plot the data $(\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15$ with a $95\%$ confidence interval for the mean response.

## code

Draw residual plots of Model in (2) and (4). According to the plots, which model you think is better?

## code

Import dose_tidy.csv and redo (2) using Python. Show the slope and intercept.


# code

Use Python to predict the response value when the dose level is 10 and 30.


# code

2.2 Logistic Regression

Import body.csv. Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule.

# code
# set.seed(your ID number)

Fit a logistic regression with the predictor HEIGHT using the training sample data. Find the probability that the subject is male given HEIGHT = 165.

# code

Fit a logistic regression with the predictor BMI using the training sample data. Find the probability that the subject is male given BMI = 25.

# code

Do the classification on the test set for the model (2) and (3), and compute the test accuracy rate. Which model gives us higher accuracy rate?

# code

Use Python to split the body data into a training set and a test set.


## code

Use Python to fit a logistic regression with the predictor BMI using the training sample data. Find the probability that the subject is male given BMI = 25.


# code

Use Python to do the classification on the test set. Compute the test accuracy rate.


# code

2.3 K-Nearest Neighbors (KNN)

Fit the KNN with $K=1$ and $10$ using BMI on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix for the two $K$s. Which $K$ performs better? Why?

# code

--- title: "Homework 3: Probability, Statistics and Machine Learning" subtitle: "Spring 2024 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu" format: html: code-fold: false code-tools: true date: today author: "**Insert Your Name!!**" number-sections: true from: markdown+emoji editor: source --- ```{r} #| label: setup #| include: false #################################################### ## !!!DO NOT MAKE ANY CHANGE OF THIS CODE CHUNK!!!## #################################################### # Package names packages <- c("knitr", "ggplot2", "ggrepel", "tidyverse", "formatR", "dslabs", "janitor", "ggthemes", "plotly", "tidymodels", "kknn") # Install packages not yet installed installed_packages <- packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) } # Packages loading invisible(lapply(packages, library, character.only = TRUE)) ``` - **Note: For any simulation or random sampling, set the random seed at your student ID number, for example `set.seed(6145678)`.** # Probability and Statistics ## Monte Carlo Simulation            1. Suppose you are in a classroom with 30 people. If we assume this is a randomly selected group of 30 people, what is the chance that at least two people have the same birthday? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29. ```{=html}  ``` i. Note that birthdays can be represented as numbers between 1 and 365, so a sample of 30 birthdays can be obtained like this: ```{r} n <- 30 bdays <- sample(x = 1:365, size = n, replace = TRUE) ``` ii. To check if in this particular set of 30 people we have at least two with the same birthday, we can use the function `duplicated()`, which returns `TRUE` whenever an element of a vector is a duplicate. Here is an example: ```{r} duplicated(c(1, 2, 3, 1, 4, 3, 5)) ``` The second time 1 and 3 appear, we get a `TRUE`. iii. To check if two birthdays were the same, we simply use the `any()` and `duplicated()` functions like this: ```{r} any(duplicated(bdays)) ``` In this case, we see that it did happen. At least two people had the same birthday. To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 30 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday. ```{r} ## code # set.seed(your ID number) ``` ## Central Limit Theorem Suppose random variables $X_1, X_2, \dots, X_n$ are independent and follow Chi-squared distribution with degrees of freedom 1, $\chi^2_{df=1}$. 1. Use `dchisq()` to plot $\chi^2_{df=1}$ distribution. Consider $x\in (0, 5)$. ```{r} ## code ``` 2. Consider three sample sizes $n = 2, 8, 100$, and set the sample size of the sample mean $\overline{X}_n$ be $1000$. Show the sampling distribution of $\overline{X}_n$, i.e., the collection $\{\overline{X}_n^{(m)}\}_{m=1}^{1000}$, looks more and more like Gaussian as $n$ increases by making histograms of $\overline{X}_n$ samples with $n = 2, 8, 100$. The procedure is the following: For each $n = 2, 8, 100$, i. Draw $n$ values $x_1, x_2, \dots, x_n$ using `rchisq(n, df = 1)`. ii. Compute the mean of the $n$ values, which is $\overline{x}_n$. iii. Repeat i. and ii. 1000 times to obtain 1000 $\overline{x}_n$s. iv. Plot the histogram of these 1000 $\overline{x}_n$s. ```{r} ## code # set.seed(your ID number) ``` # Machine Learning ## Linear Regression A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained. 1. Import [`dose.csv`](./dose.csv) into your working session. The data set is not tidy. Use `pivot_longer()` to make it tidy as the shown tibble below. Call the tidy data set `dose_tidy`. ```{r} ## code ## # A tibble: 15 × 3 ## dose_level tube response ## <dbl> <chr> <dbl> ## 1 2 tube1 5 ## 2 2 tube2 7 ## 3 2 tube3 3 ## 4 4 tube1 10 ## 5 4 tube2 12 ## 6 4 tube3 14 ## 7 8 tube1 15 ## 8 8 tube2 17 ## 9 8 tube3 18 ## 10 16 tube1 20 ## 11 16 tube2 21 ## 12 16 tube3 19 ## 13 32 tube1 23 ## 14 32 tube2 24 ## 15 32 tube3 29 ``` 2. Fit a simple linear regression with the predictor $\texttt{dose level}$ for `response`. Print the fitted result. ```{r} ## code ``` 3. With (2), plot the data with a $95\%$ confidence interval for the mean response. ```{r} ## code ``` 4. Fit a simple linear regression model with the predictor $\texttt{ln(dose level)}$ for `response`, where $\ln = \log_e$. Print the fitted result. ```{r} ## code ``` 5. With (4), plot the data $(\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15$ with a $95\%$ confidence interval for the mean response. ```{r} ## code ``` 6. Draw residual plots of Model in (2) and (4). According to the plots, which model you think is better? ```{r} ## code ``` 7. Import [`dose_tidy.csv`](./dose_tidy.csv) and redo (2) using **Python**. Show the slope and intercept. ```{python} # code ``` 8. Use **Python** to predict the response value when the dose level is 10 and 30. ```{python} # code ``` ## Logistic Regression 1. Import [`body.csv`](./body.csv). Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule. ```{r} # code # set.seed(your ID number) ``` 2. Fit a logistic regression with the predictor `HEIGHT` using the training sample data. Find the probability that the subject is male given `HEIGHT = 165`. ```{r} # code ``` 3. Fit a logistic regression with the predictor `BMI` using the training sample data. Find the probability that the subject is male given `BMI = 25`. ```{r} # code ``` 4. Do the classification on the test set for the model (2) and (3), and compute the test accuracy rate. Which model gives us higher accuracy rate? ```{r} # code ``` 5. Use **Python** to split the `body` data into a training set and a test set. ```{python} ## code ``` 6. Use **Python** to fit a logistic regression with the predictor `BMI` using the training sample data. Find the probability that the subject is male given `BMI = 25`. ```{python} # code ``` 7. Use **Python** to do the classification on the test set. Compute the test accuracy rate. ```{python} # code ``` ## K-Nearest Neighbors (KNN) 1. Fit the KNN with $K=1$ and $10$ using `BMI` on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix for the two $K$s. Which $K$ performs better? Why? ```{r} # code ```