Document

Data Analysis AMS315

luick2024-04-30 14:18:58

Data Analysis, Spring 2024
First Computing Assignment
One Predictor Linear Regression
Introduction
This assignment is due on Thursday, April 4. This report is worth 100 points. Please
remember that there is a second project coming, so that you should finish the first project as soon
as possible. Please submit your project on the Class Brightspace as instructed below. Please
submit your report of Project 1 (both parts) in one pdf file. Each student has one chance to
resubmit the report before the deadline. Detailed submission information is given below.
Project 1 has two parts. There are three files for this project. Two are for part A, and one
is for part B. The files are labeled with the last six digits of your Stony Brook ID number.
Part A
Part A is worth 40 points. The model for the Part A assignment is a first data and
statistical processing task that a newly hired statistician might be given. Your report should
address the issues that your future supervisor would want to know about: how many
observations, fraction of missing data in independent variable and dependent variable, and
imputation of missing data.
The two files for part A each contain a column for subject ID and a column for either the
dependent variable value or the independent variable value. Your first task is to sort the two files
by subject ID and merge them. You should not just use “cut and paste” to merge your data.
Second, you are expected to deal with missing data. Your report should contain the count of the
number of subject IDs that had at least one independent variable value or dependent variable
value. It should also include the count of the number of subject IDs that had an independent
variable value, the count of the number of subject IDs that had a dependent variable value, the
count of the number of subject IDs that had both an independent and dependent variable value,
and the count of the number of subject IDs that had at least one independent variable value or
dependent variable value.https://weibo.com/u/7916053997
Your second task is to impute the missing values. There are many of missing data
procedures. Often a statistical package has imputation algorithms in the software. For example, R
has a package called MICE that has several options. You may not choose listwise deletion or
mean imputation (or its equivalent median imputation). Specify your choice in your report.
Often, the choice of imputation method has little effect on the results if the fraction of missing
data is 30% or less.