Custom R Regression Functions You Might Find Useful
Background
I recently finished one of the most labor-intensive projects I’ve ever worked on. The project involved a series of lab-in-the-field surveys at multiple timepoints, dozens of different datasets, and hundreds of variables, many with missing or incomplete data.
When I started out the project I was completely new to R. Needless to say, I suffered a very steep learning curve.
But over time I was able to develop some functions that made my life a whole lot easier. I thought I would describe a few of these in a post so that others might not have to suffer the same head-bashing-against-the-wall issues that I did.
In this post, I will introduce you to a family of functions in R I’ve created that allow you to:
- Test a series of regression models, efficiently controlling for the same variables in each model.
- Quickly copy-paste results into a manuscript.
- Easily identify missing data.
- Produce publication-ready regression tables for presentations or supplementary materials.
The Main Function
The primary function is called my.lm(). All associated code is available in the associated Github repository.
The function has a number of features, which I’ll explain in turn using the “mtcars” dataset.
Feature 1: Specify whether you want the output of the function to appear in the RStudio Console, as a PDF (via R Markdown: Knit), or both.
Generally, I find there are two phases of statistical research. The first is when you are running various statistical tests trying to figure out what is going on with your data. The second is when you want to print your results to a PDF file via R Markdown, where you can share your results with others or even submit them for publication.
The function my.lm() takes the arguments “Console” and “PDF” as Boolean variables and produces results according to whether these are set to TRUE or FALSE.
For now, let’s imagine that we want to just print our results to the console. So before we proceed with the function, we will set:
Console = T
PDF = F
Feature 2: Store and reuse control variables.
The my.lm() function allows you to set a list of variables as controls and reuse them repeatedly in your analysis.
Let’s imagine, for example, that we want to test the relationship between the number of cylinders a car has (“cyl”) and its miles per gallon (“mpg”), controlling for the number of forward gears weight (“wt”) and horsepower (“hp”).
So first, we set weight and horsepower as controls like so:
mtcars.controls <- c("wt", "hp")
Now we’re ready to run the regression. The my.lm() function takes arguments in the following form:
my.lm(data = mtcars,
dv = "mpg",
predictor = "cyl",
controls = mtcars.controls)
When I run this function, my output looks like this:
Let’s talk a little bit about what you’re looking at.
As you can see, the first part of the function is the basic model summary. This includes all coefficients in the model as well as their associated statistics.
Then, included beneath the summary (on the bottom line) is a manuscript-ready statistical output of the relationship between the primary predictor and the dependent variable. This is produced by the function my.lm.output() which is called by my.lm().
The statistical parameters are surrounded by underscores, which you can easily convert to italics using advanced find-and-replace in Word.
By default, my.lm() only prints the manuscript-ready coefficient of the primary predictor. But you can specify a different variable if you want using the “row.name” argument.
Here, for example, I’ve decided I want the MS-ready output of the “weight” predictor:
my.lm(mtcars,
dv = "mpg",
predictor = "cyl",
controls = mtcars.controls,
row.name = "wt")
One thing I think is helpful about this function is it allows you to store a set of control variables and reuse them multiple times. So suppose we now wanted to look at the relationship between number of carburetors (“carb”) and miles per gallon, controlling for the same variables. All we’d do is replace “cyl” with “carb” in the function and we’d be good to go:
my.lm(mtcars,
dv = "mpg",
predictor = "carb",
controls = mtcars.controls)
This starts to become really handy when you’re dealing with a dozen control variables and want to make sure you aren’t missing any in your models. It also takes up a lot less space in your code and ensures that you can easily spot differences and similarities between the models you’ve specified.
Feature 3: Adding/removing control variables
Another built-in aspect of the function is the ability to add or remove control variables. So suppose we wanted to run the exact same regression, but this time we wanted to additionally control for number of cylinders AND number of carburetors. The “add.controls” argument in the function allows you to add additional controls:
my.lm(mtcars,
dv = "mpg",
predictor = "carb",
controls = mtcars.controls,
add.controls = "cyl")
Similarly, “rm.controls” allows you to run the model while removing one of the control variables. Here I’ve removed horsepower:
my.lm(mtcars,
dv = "mpg",
predictor = "carb",
controls = mtcars.controls,
rm.controls = "hp")
Feature 4: Including Interactions
Interactions are also really easy to run in my.lm(). All you have to do is add the interaction term as a string variable to the “predictor” argument, then in the “row.name” argument, make sure you input the variable as it would appear in the model summary.
Here is an example of how you would add an interaction term to the my.lm() function:
my.lm(mtcars,
dv = "mpg",
predictor = "carb*gear",
controls = mtcars.controls,
row.name = "carb:gear")
And the resulting output:
Checking Missing Data
One thing I noticed when I was performing my analysis is that my degrees of freedom were varying substantially from one analysis to the next. This is because I was working with a messy dataset in which participants had skipped certain questions or the variables differed from one dataset to the next.
It became critical for me to see which variables were contributing to my missing data so that I could make a decision about what to do about it.
Fortunately, the “mice” package has a very handy function that allows you to identify missing data from a model. I used this to create a very simple function called my.missing.data.model() that takes a model as input and returns a summary of the missing data in that model.
The nice thing about the my.lm() function is that it returns the model invisibly. This means that you can perform pipe-like operations on the model once you’ve called it. The following set of functions runs the original regression model and then tests it for missing data:
my.lm(mtcars,
dv = "mpg",
predictor = "cyl",
controls = mtcars.controls) %>%
my.missing.data.model()
Obviously here we have a complete dataset, but if you were dealing with missing data, the output would indicate which variables utilized in the regression equation were incomplete.
Knitting to PDF
Now let’s imagine that we’re ready to knit the regression tables to a PDF in order to share with a collaborator or publish in a Supplementary Online section of your manuscript.
In the past when I’ve done this, I’ve been forced to copy the analytic code over again and import to a regression output handler like Stargazer in R Markdown.
But with my.lm(), you can quickly set
Console = F
PDF = T
at the beginning of your code, press “knit,” and your document will output to a PDF with beautiful tables. The reason is that my.lm() is set to call the function my.print.lm(), which converts the output to apa table format via the papaja package.
Note: I’m deeply indebted to my collaborator Matthias Forstmann for developing much of this code. With his permission I’ve uploaded an example R Markdown file so that you can use the settings provided.
Here is an example of the output PDF:
Multilevel Models
I’ve also developed a very similar function called my.lmer() that allows you to specify random effects in the model. I’ll show some examples of how that works here.
For illustration purposes, I’ll use a publicly available dataset that describes student achievement on a standardized math test alongside a variety of other demographic variables.
As before, we can use the my.lmer() function to specify regression models and check their summaries. Let’s first examine whether time spent on homework each week (“homework”) predicts math scores (“math”), controlling for student socioeconomic status (“ses”), sex (“sex”), and the teacher/student ratio at the school (“ratio”). Let’s also specify school (“schid”) as a random effect in this model to adjust for between-school differences.
math_scores <- readRDS("example data.Rda")math.controls <- c("sex", "ratio", "ses")Console = T
PDF = Fmy.lmer(math_scores,
dv = "math",
predictor = "homework",
rdm.effect = "schid",
controls = math.controls)
As you can see, we’ve added the “rdm.effect” argument in the model to specify that school ID is a random effect. The rest of the model is the same, and we end up with the following output:
Again, the best part of this is that you can see the full model summary, and the primary predictor variable is included in a manuscript-ready statistical format at the bottom of the output.
We can add or remove controls just like before, and again, the model is returned invisibly, allowing you to perform any operations you like on it while maintaining the summary output in the console.
Missing Data
Again, we can use the my.missing.data.model() function to check missing data.
my.lmer(math_scores,
dv = "math",
predictor = "homework",
rdm.effect = "schid",
controls = math.controls) %>%
my.missing.data.model()
Knitting to PDF
Once again, if you set
Console = F
PDF = T
before you knit the R Markdown file, you’ll end up with pretty-looking regression tables that look like this:
Conclusion
There you have it! A set of functions that facilitate repeated regression models in R, easy copy-paste of statistical reports, and PDF-generation.
Complete code for running both the my.lm() and my.lmer() examples are available here and here, respectively.
As I said, you can download the entire set of functions from the github repository. Start by opening the .Rproj file and proceed from there to ensure the working directory is set up correctly.
If you have any questions or issues running this code, please let me know. This is my first blog post on a code, so I’d be eager to hear any ways the code, or my communication of it, could be improved.
Happy coding!