预定/报价
Intro to Data Science INFO180
妮妮2024-04-27 17:34:00

                          

INFO180 Problem Set: Questions and answers (200pt)

 

October 16, 2023

 paper、assignment、lab,project,exam,quiz,以及网课代修,商科,数学 特别是擅长IT CS 特别是擅长IT CS 特别是擅长IT CS 代码工整,注释详细,提供答疑。网课保分, 100%原创最低包过!免费送重复率检测报告! 需要帮助的同学们添加备用哦!随时免费咨询!
+V:iuww1314

Instructions

This is your first problem set in this course. Your task is to use a dataset, and try to answer a few questions using those data. You should describe the data, explain your analysis, and discuss your findings. The 200 “grading” points of this PS are equal to 20 points on your final grading scale. Note that this is a groupwork, and for the first PS you will be a member of a random group (PS1 group on canvas). Each group only needs a single submission!

Submit your writing, the dataset (or the link to it), and include screenshots of any code you may end up writing. Submit your writing in a format that canvas can display. If in doubt the pdf will always do. Check out the separate file ps01-example.pdf to see how the solution might look like. The existing reading materials are, admittedly, thin. There is some information though on my INFO 180 notes,

including

• data integrity (Ch 3.4)

• preliminary data anlysis (Ch 6)

• missing values (6.2), variable quality (6.3)

• questions and how relevant are the questions (8.1)

• sampling (8.4)

• how to write report (9)

 

1 Descriptive analysis (50pt)

Choose a dataset. First, let’s do some (non-statistical) description of the dataset. Answer all these questions as well as you can. Just “yes” and “no” answers are not enough, you have to justify why do you think the sample is biased, or why do you think it is trustworthy.

1. (5pt) What is this dataset about?

2. (5pt) Where did you get the dataset?

3. (10pt) Who collected it?

4. (15pt) What can you say about the sampling? What is the sample and what is the population you want to analyze here? How exactly was the data collected? Do you think the sample is somehow biased or not?

5. (15pt) How trustworthy is the dataset?

 

2 Data and Research Questions (55pt)

Next we take a look at your research questions, and how do data align with those.

 

1. (15pt) List the questions you want to answer using the dataset. It should be (at least) four questions. The questions should be reasonable, related to these data, and something that you expect you might be able to answer on these data.

It is OK if you later discover that 1-2 questions cannot be answered using these data (or that the questions itself are too vague). Just explain why this is the case.

2. (10pt) Load your data and explore what variables (columns) are there in the dataset. Explain what do the variables mean.

Note: you only need to explain variables that you think will help you to answer questions. You may ignore age if your questions are solely about gender, or ignore geographic location if you are only interested in age.

3. (20pt) How are the variables of interest coded? Are they numeric or categorical? If numeric, do the values fall into a reasonable range? If categorical, what are the categories?

4. (10pt) How many missing values do you see in the variables of interest? Are there missings that are coded differently than the system NA?

 

Answer your questions (95pt)

1. (4 × 20pt) Next, answer your questions based on data.

• If you need additional data (e.g. population figures), explain where do you get those (e.g. from census), and how reliable you find those.

• If you cannot answer a question, then explain why: is the question too unclear, or is your data not suited for answering it? Be specific!

2. (15pt) Discuss the limitations of your answers, data, analysis. Normally the answers do not answer everything. Why? What kind of data would you need to get better answers? Would you need better analysis methods?

 

How much time did you spend?

And finally-finally, tell us how much time (how many hours) did you spend on this PS!