need help understanding a Rmarkdown

Midterm Instructions This RMarkdown document is the entirety of the midterm exam. Like the problem sets in this class, you will be working with actual social science data and using the statistical and conceptual tools you’ve learned so far this term to answer social science questions. Please submit the midterm exam through the Canvas midterm submission portal by Monday, November 4th at 8pm. Late submissions can be submitted up to 24 hours late for a 10% reduction. Other key rules for midterm:

• Submit a compiled PDF, not an .rmd document. If you are having trouble compiling/submitting your PDF, please email Prof. Waight before the deadline.

• Prof. Waight will be available over email to answer clarifying questions only. If Prof. Waight deems a question important enough to share with the entire class, she will send out a class announcement over email during the midterm exam. Please check your email to make sure you don’t miss these.

• Absolutely no collaboration with other students is allowed during the midterm. This means no discussing approaches, no sharing or comparing code/answers. We also do not allow consultation with Data Services. As with all class assignments we do not allow the use of generative AI or consulting with “friends on the internet.”

• Please edit the header of this document to include your name. Also rename the compiled document with your name. We will take points off if you don’t do this.

• We will take points off generally for any coding errors or poor coding conventions (for example not putting breaks in the code, so that it runs off the page). Please answer all text-based questions in full sentences. When answering interpretation-based questions, please make sure to include all elements we referenced in lecture:

– Experiments: state assumption, justify assumption, state treatment in substantive language (i.e. don’t just say the name of the variable in the dataset, describe what the variable represents), state outcome in substantive language, indicate direction and size of causal effect, use appropriate unit of measurement, use causal language.

– Descriptive (all other interpretations): state variable(s) in substantive language, include estimate(s), use appropriate units of measurement, indicate the sample the data is coming from.

• For all visualizations, axis and title labels should be substantively interpretable (i.e. provide a substantive label for what a given variable represents, don’t just include the name of the variable).

• Make sure to include all code in codeblocks and all text answers outside of codeblocks. We have included the RMarkdown guide as a reference in this Posit Cloud project.

Part 1: Eviction, Debt, and Perceptions of Neighboorhood Safety As argued by sociologist Matthew Desmond and others, eviction has profound consequences for American life. In the first part of this problem set we’re going to use the Future of Families dataset to investigate the connection between experiencing eviction, household income and debt, and perceptions of neighborhood safety. The data we’re working with for this part of the problem set is included in the debt.rds dataframe. This is the same set of families we’ve been working with thus far this term, just a new set of variables from the same study. Familiarize yourself with the new variables by reading the “midterm_overview_debt_eviction_data.pdf” overview.

1

 

 

Question 1: Reading in the Data (10 pts) In the code block below read in the debt.rds dataframe and the ggplot2 library. Also remove any observations with missing values using na.omit(). You need this code to work to finish the rest of the problem set so if you are having any trouble here please reach out to Prof. Waight for help (this is the only part of the problem set she will help you with, but we will take points off if you get help). library(ggplot2)

debt_data <- readRDS(“debt.rds”) debt_data <- na.omit(debt_data)

Question 2: Getting Acquainted with the Data (10 pts, 2.5 pts each) Let’s get acquainted with these new variables from the Future of Families survey. Read the intro pdf (midterm_overview_debt_eviction_data.pdf) and then fill in the following bulleted list to indicate for a selection of variables their type (categorical or quantitative), whether they are discrete or continuous, their variable scale (nominal, ordinal, interval, or ratio), and whether they are binary or non-binary. We’ve filled it in for the first variable so you can get the idea. You don’t need to include full sentences, just add in the correct label.

• pcg.hhinc.15 – type: quantitative – discrete or continuous: discrete – scale: ratio – binary or non-binary: non-binary

• c.neigh gangs.15 – type: categorical – discrete or continuous: discrete – scale: ordinal – binary or non-binary: non-binary

• pcg.net_worth.15 – type: quantitative – discrete or continuous: continuous – scale: ratio – binary or non-binary: non-binary

• pcg.eviction.15 – type: categorical – discrete or continuous: discrete – scale: nominal – binary or non-binary: binary

• pcg.housing.15 – type: categorical – discrete or continuous: discrete – scale: nominal – binary or non-binary: non-binary

Question 3: Examining Debt Distribution (10 pts) We’re going to start by examining the distribution of the primary caregiver’s total debt when the focal child was 15. This question has two parts, each worth five points:

• In the code block below calculate the mean and median of the total debt variable. When calculating the mean, do not use the mean function. You can use the mean() function to check your work, but you must provide code to calculate the mean without the function.

2

 

 

• Below the code block discuss whether the mean is approximately equal to, greater then, or less than the median. Discuss what this implies about the distribution of the total debt variable. Is it approximately symmetric, left skewed, or right skewed?

median(debt_data$pcg.total_debt.15)

## [1] 9500

Question 4: Visualizing the Debt Distribution (10 pts) 1. Pt 1 (5 pts):

In the code block below visualize the debt distribution with a density plot. Add appropriate axis labels, a title, and a theme. We also want you to add one a line to the density plot by adding the following layer to your ggplot: + geom_vline(mapping = aes(xintercept = 20000), lty = 2). The geom_vline() code draws a vertical line on the plot at the x-axis value of 20,000. We have included in the appendix an image of what your graph should look like (scroll to the bottom of the compiled pdf of this doc to see it). You need to produce the code to reproduce the graph.

2. Pt 2 (5 pts):

23.3% of respondents reported a total debt greater than or equal to 20,000. In text below the code block, answer the following question: what is the estimated total area under the density curve to the left of the vertical line (drawn at 20,000)?

Question 5: Spread of Debt Variable (10 pts) Let’s look at how much spread there is for the debt variable. This questions has two parts:

1. Calculate the standard deviation of the debt variable for two different groups of respondents: families where the primary caregiver experienced eviction in the six years prior to the survey and families where the primary caregiver did not. We will give you the estimates here, you need to produce the code to calculate these estimates (sd evicted: approximately 29081, sd not evicted: approximately 101408) (5 pts)

2. Interpret the standard deviations. Your interpretation should include appropriate units of measurement and analysis, substantive descriptions of variables, and a discussion of what any differences between the two standard deviations imply. (5 pts)

Question 6: Debt, Eviction, and Perceptions of Neighborhood Safety We’re going to use visualizations to explore the relationship between eviction, focal child’s perceptions of neighborhood safety during the evening, and primary caregiver total debt. This question has four parts:

1. Create a new variable in the debt dataset called “debt_binary” which indicates whether the family has any debt (i.e. debt is greater than 0). (2.5 pts)

2. Create a visualization where the main variable on the x-axis is focal child perception of neighborhood safety at night. Using facet_wrap() break out this visualization in terms of whether the primary caregiver experienced eviction or not. Within facet_wrap() include the following argument: scales = “free”. Your facet_wrap should look like this: facet_wrap(~ pcg.eviction.15, scales = “free”). This argument creates different axis scales for the two facets, which is helpful in cases like this where there are many more people who did not experience eviction than did. In your visualization make sure to include appropriate axis labels and make sure your axis labels have appropriate ordering. We have included an example of what this plot should look like in the appendix, you just need to reproduce the code to create the plot. It’s okay if your axis labels overlap. (2.5 pts)

3. Create a second visualization where the main variable on the x-axis is your new binary debt variable. Again using facet_wrap break out this visualization by whether the primary caregiver experienced eviction or not. Still use the scales = “free” argument within the facet_wrap code. We have again

3

 

 

included an example of what this plot should look like in the appendix, you just need to reproduce the code to create the plot. (2.5 pts)

4. In text below the code block, interpret your two visualizations in terms of the relationship between the focal child’s perceptions of neighborhood safety and eviction, and family debt and eviction. You don’t need to include point estimates here, just substantive interpretations of what the plots show. (2.5 pts)

Question 7: Eviction and Perceptions of Neighborhood Safety (10 pts) We’re going to explore the relationship between experiences of eviction and perceptions of neighborhood safety a bit more.

1. In the code block below, create a table of proportions which calculates the proportion of focal children which gave different answers to the perceptions of neighborhood safety at night variable by whether their family experienced eviction. We have included an image of what the table should look like in the appendix, it’s your job to reproduce the code. (5 points)

2. In text below the code block, answer the following question. Among children who experienced eviction, what percent of children either “somewhat agreed” or “strongly agreed” that they felt unsafe at night? What was this estimate for children whose families did not experience eviction? (5 points)

Question 8: Family Household Income and Debt For this question we’re going to explore the relationship between family household income and debt in the future of families dataset. This question has two parts:

1. Create a scatterplot with debt on the y-axis and household income on the x-axis. Include appropriate axis labels, a title, and a theme. Avoid overplotting by setting alpha = .5 within geom_point().

2. Calculate the correlation between household income and debt. Do you observe a positive, negative, or zero association? Interpret this pattern in substantive terms: do families with greater household income have more debt? Report in text below the code block.

Midterm Part 2: The Mark of a Criminal Record In the second part of the midterm (last two questions) we’re going to draw on data from Devah Pager’s experiment: “The Mark of a Criminal Record.” Pager was interested in the effects of mass incarceration on American society: how interactions with the criminal justice system impacted individuals’ ability to get a job. To answer her research question Pager ran an audit experiment where she she had real individuals apply to jobs with identical resumes in the Milwaukee area. She randomized whether the person indicated they had a criminal record by including a parole officer as a reference for a random subset of the resumes. Mass incarceration has disproportionately impacted Black populations in the United States. Pager’s second research question was whether Black applicants who have been incarcerated experience greater discrimination in their job search than formerly incarcerated White applicants. To test this second question Pager randomized the race of the applicant by using both Black and White research assistants posing as job applicants. She randomized whether the otherwise identical applications were delivered by a Black versus White “job applicant.”

We have the actual data from Pager’s experiment in the applications.csv file. The “criminal” variable indicates whether the resume included a parole officer as a reference. The “race” variable indicates whether the research assistant posing as a job application was White or Black. The “call” variable indicates whether the respondent received a call back from a potential employer (1 if yes, 0 if no).

Question 9 Getting to Know the Data This question has two parts:

4

 

 

1. In the code block below, read in the applications.csv variable. If you are having trouble doing this please contact Prof. Waight for help. We will take off points for receiving help but will help you write this code so you can do the rest of the midterm. (2 points)

2. Below the code block, discuss which of the “criminal,” “race”, and “call” variables are the treatment variable(s), and which are the outcome variable(s). For the treatment variable(s), discuss which is the treatment group and which is the control group. Use substantive langauge when answering this question (i.e. explain what the variables represent). (8 points)

Question 10: Experiment Results This questions has two parts.

1. In the first code block below, calculate the difference in means for the call back variable by whether the application signalled a criminal record or not (you should get an estimate of -.126, you need to write code to produce this estimate). In text below the first code block, interpret the results of this calculation. Make sure to follow the interpretation conventions outlined in the header of this document and discussed in lecture.

2. In the second code block below, subset the applications dataset to just applications which had a criminal record. Calculate the difference in means for the call back variable by whether the application was delivered by a Black versus White research assistant (you should get an estimate of -.116, you need to write code to produce this estimate). In text below the second code block, interpret the results of this calculation.

Appendix Plots 1. Question 4 Debt Density Plot

2. Question 6 Plots

5

 

 

3. Question 7 Table

6

 

  • Midterm Instructions
  • Part 1: Eviction, Debt, and Perceptions of Neighboorhood Safety
    • Question 1: Reading in the Data (10 pts)
    • Question 2: Getting Acquainted with the Data (10 pts, 2.5 pts each)
    • Question 3: Examining Debt Distribution (10 pts)
    • Question 4: Visualizing the Debt Distribution (10 pts)
    • Question 5: Spread of Debt Variable (10 pts)
    • Question 6: Debt, Eviction, and Perceptions of Neighborhood Safety
    • Question 7: Eviction and Perceptions of Neighborhood Safety (10 pts)
    • Question 8: Family Household Income and Debt
  • Midterm Part 2: The Mark of a Criminal Record
    • Question 9 Getting to Know the Data
    • Question 10: Experiment Results
  • Appendix Plots