Data Science Overview šŸ“–

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

What is Data Science or a Data Scientist?

A Little History of Data Science

Source: https://www.reddit.com/r/meme/comments/floq3q/reality_behind_data_science/

Source: https://br.ifunny.co/picture/we-will-work-together-statistics-computer-science-please-teach-now-h4hdtthT9?s=cl

šŸ˜• Still what on earth is data science?

Battle of the Data Science Venn Diagrams

Battle of the Data Science Venn Diagrams

Battle of the Data Science Venn Diagrams

Battle of the Data Science Venn Diagrams

Battle of the Data Science Venn Diagrams

Shall We Continue?

  • You probably get the idea. There are so many ways to define data science.

Nobody Knows What Data Science (Scientist) is

What Wiki Defines

Data Science in This Course

  • Data science is an discipline that allows us to turn raw data into understanding, insight, and knowledge.

  • Weā€™re going to learn to do this in a tidy way ā€“ more on that later!

  • This is a introductory data science course with an emphasis on important tools in R/Python that help us do data science.

A Data Science Project

Data Science Workflow

  • Import: Take data stored somewhere and load it into your workspace.
  • Tidy: Storing data in a consistent rectangular form, i.e., a data matrix.
  • Transform: Narrowing in on observations of interest, creating new variables, calculating statistics.

Data Matrix

  • Each row corresponds to a unique case or observational unit.
  • Each column represents a characteristic or variable.
  • This structure allows new cases to be added as rows or new variables as new columns.
ggplot2::mpg |> print(n = 10)
# A tibble: 234 Ɨ 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 autoā€¦ f        18    29 p     compā€¦
 2 audi         a4           1.8  1999     4 manuā€¦ f        21    29 p     compā€¦
 3 audi         a4           2    2008     4 manuā€¦ f        20    31 p     compā€¦
 4 audi         a4           2    2008     4 autoā€¦ f        21    30 p     compā€¦
 5 audi         a4           2.8  1999     6 autoā€¦ f        16    26 p     compā€¦
 6 audi         a4           2.8  1999     6 manuā€¦ f        18    26 p     compā€¦
 7 audi         a4           3.1  2008     6 autoā€¦ f        18    27 p     compā€¦
 8 audi         a4 quattro   1.8  1999     4 manuā€¦ 4        18    26 p     compā€¦
 9 audi         a4 quattro   1.8  1999     4 autoā€¦ 4        16    25 p     compā€¦
10 audi         a4 quattro   2    2008     4 manuā€¦ 4        20    28 p     compā€¦
# ā„¹ 224 more rows

  • Visualisation: A good visualisation shows you things that you did not expect or raise new questions about the data.

mpg |> ggplot(aes(x = displ, y = hwy)) +
    geom_point(aes(color = class)) + 
    geom_smooth() + 
    theme_bw()

  • Model: Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them.

library(tidymodels)
linear_reg() |>  
    set_engine("lm") |> 
    fit(hwy ~ displ, data = mpg)
parsnip model object


Call:
stats::lm(formula = hwy ~ displ, data = data)

Coefficients:
(Intercept)        displ  
      35.70        -3.53  

  • Communication: It doesnā€™t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.

  • Programming: Surrounding all these tools is programming.

R for Data Science

Source: https://teachdatascience.com/tidyverse/

Python for Data Science

Source: https://www.e2enetworks.com/blog/9-python-libraries-for-data-science-and-artificial-intelligence