One picture is worth a thousand words. Thatâ€™s why visualizing data and model results is a crutial skill for any data scientist. {sjPlot} package became my favorite tool for visualization. Thatâ€™s why I want to share with you some simple but very effective commands which will make you more productive today. So, letâ€™s visualize Wage dataset, visualize bunch of models and see what people earn and what factors determine the salary.
I recommend to watch a video first, because I highlight things I talk about. Itâ€™s less then 9 minutes long.
view_df()
, plot_frq()
, save_plot()
, plot_grpfrq()
, plot_grid()
, plot_xtab()
, tab_xtab()
, plot_gpt()
, plot_likert()
, plot_model()
, tab_model()
, plot_models()
view_df
View dataframe (view_df
) function with only 4 arguments, (1) your data, (2) show frequencies, (3) show percentages and (4) show missing values, displays a range of numeric variables and the counts + percentages of missing values and categorical variables, giving you a nice big picture of your data.
view_df(Wage, show.frq = T, show.prc = T, show.na = T)
ID  Name  Label  missings  Values  Value Labels  Freq.  % 

1  year  0 (0.00%)  range: 20032009  
2  age  0 (0.00%)  range: 1880  
3  maritl  0 (0.00%) 

648 2074 19 204 55 
21.60 69.13 0.63 6.80 1.83 

4  race  0 (0.00%) 

2480 293 190 37 
82.67 9.77 6.33 1.23 

5  education  0 (0.00%) 

268 971 650 685 426 
8.93 32.37 21.67 22.83 14.20 

6  region  0 (0.00%) 

0 3000 0 0 0 0 0 0 0 
0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 

7  jobclass  0 (0.00%) 

1544 1456 
51.47 48.53 

8  health  0 (0.00%) 

858 2142 
28.60 71.40 

9  health_ins  0 (0.00%) 

2083 917 
69.43 30.57 

10  logwage  0 (0.00%)  range: 3.05.8  
11  wage  0 (0.00%)  range: 20.1318.3 
plot_frq
, plot_grpfrq
and plot_grid
However, we often want to see an actual picture, for example, display frequencies and percentages of categorical variables on a bar plot. For that {sjPlot} package provides a convenient plotfrequencies (plot_frq
) function, which does just that. For instance, plotting education shows that around 9% of people in our data did not finish a high school, while around 14% have a PhD.
Since {sjPlot} package works with tidyverse ðŸ¥³, we can easily group the data by any other categorical variable, letâ€™s take race, and get frequencies and percentages for every group. plot_grid()
function puts several subplots in a single plot and even names the subplots. For instance, a subplot C shows that most of AfroAmericans in our dataset ARE highly educated. And of coarse you can save this publicationready plot with â€¦ surprise surpriiise â€¦ save_plot
command.
save_plot(filename = "race_vs_education.jpg", fig = p, width = 30, height = 19)
quartz_off_screen
2
While seeing counts and percentages of separate groups is cool, we sometimes want to put groups directly near each other. And thatâ€™s exactly what plotgroupedfrequencies (plot_grpfrq
) function does. For instance, it clearly shows that most of the people with lower education degrees work in factories, while folks with higher education degrees work with information.
plot_grpfrq(
var.cnt = Wage$education,
var.grp = Wage$jobclass)
This IS already useful, however, plot_xtab
function goes one step further and displays percentages of jobclasses inside of every educational degree as stackedbars, where counts are identical to the previous plot, but every educational category as one 100 percent. Such display only reinforces our hypothesis that highly educated folks usually work in the IT and shows a clear association between jobclass and education. As if that were not enough, plot_xtab
even tests this hypothesis with the ChiSquared test of association and displays a significant pvalue and a large effect size.
# as stacked proportional bars
plot_xtab(
x = Wage$education,
grp = Wage$jobclass,
margin = "row",
bar.pos = "stack",
show.summary = TRUE,
coord.flip = TRUE)
So, plot_xtab
essentially visualizes cross tables, also known as pivot tables. And if for some reason you want an actual table with the results of a statistical test, you can use tab_xtab
function instead.
tab_xtab(
var.row = Wage$education,
var.col = Wage$jobclass,
show.row.prc = T)
education  jobclass  Total  





190 70.9Â % 
78 29.1Â % 
268 100Â % 

636 65.5Â % 
335 34.5Â % 
971 100Â % 

342 52.6Â % 
308 47.4Â % 
650 100Â % 

274 40Â % 
411 60Â % 
685 100Â % 

102 23.9Â % 
324 76.1Â % 
426 100Â % 
Total 
1544 51.5Â % 
1456 48.5Â % 
3000 100Â % 
Ï‡^{2}=282.643 Â· df=4 Â· Cramerâ€™s V=0.307 Â· p=0.000 
(not part of the video) By the way, we can decide what kind of percentages are calculated, rows or columns or even single cells, whether we want to stack the bars and many more. It will automatically conduct Fisherâ€™s test of association if samples are small (<5).
The pvalues are based on chisq.test of x and y for each grp.
But enough about counting, since our dataset is about salaries, letâ€™s figure our who earns more, industrial or information workers? Plot frequencies function, which we used for counting, can also easily answer this question if we give it a (1) wage
variable, (2) tell it to plot a histogram
, (3) to show an average with standard deviation and (4) to display a normal curve to see whether our salaries are normally distributed. This visualization reveals that industrial workers get 103 thousand dollars on average, while IT crowd gets 17 thousand more.
The last thing Iâ€™d like to share with you before we visualize models, is a visualization of likert scale data. If you have same scales or categories across different variables, plot_likert
function nicely compares percentages of scales or categories across those variables.
data(pisaitems)
d < pisaitems %>%
dplyr::select(starts_with("ST25Q"))
view_df(d, show.frq = T, show.prc = T)
ID  Name  Label  Values  Value Labels  Freq.  % 

1  ST25Q01 
Never or almost never A few times a year About once a month Several times a month Several times a week 
6682 13143 13995 20353 11436 
10.18 20.03 21.33 31.02 17.43 

2  ST25Q02 
Never or almost never A few times a year About once a month Several times a month Several times a week 
24019 16789 10317 9489 4751 
36.75 25.68 15.78 14.52 7.27 

3  ST25Q03 
Never or almost never A few times a year About once a month Several times a month Several times a week 
11131 16164 12818 14569 10658 
17.04 24.74 19.62 22.30 16.31 

4  ST25Q04 
Never or almost never A few times a year About once a month Several times a month Several times a week 
20145 19961 12768 8797 3622 
30.85 30.57 19.55 13.47 5.55 

5  ST25Q05 
Never or almost never A few times a year About once a month Several times a month Several times a week 
12317 12141 10314 15645 15165 
18.78 18.51 15.73 23.86 23.12 
plot_likert(d)
Visualizing data is quite, but letâ€™s get to the really cool visualization stuff!
plot_model
function is the actual reason I love {sjPlot} package. I literally use it everyday!
For example if I want to know how education influences salary, Iâ€™ll plot predictions from a simple linear model. Plotting prediction immediately tells me the story. Namely, people who did not even finish a high school, have the lowest salary compared to any other education level. Moreover, we can see that increasing education level means increasing salaries. So, education matters!
m < lm(wage ~ education, data = Wage)
plot_model(m, type = "pred")
$education
The only thing we canâ€™t see from this plot is whether this increase is significant or not. We could use a well known summary
table for that, but the output, although useful, is not really pleasing to the human eye and is not suitable for publication.
summary(m)
Call:
lm(formula = wage ~ education, data = Wage)
Residuals:
Min 1Q Median 3Q Max
112.31 19.94 3.09 15.33 222.56
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 84.104 2.231 37.695 < 2e16 ***
education2. HS Grad 11.679 2.520 4.634 3.74e06 ***
education3. Some College 23.651 2.652 8.920 < 2e16 ***
education4. College Grad 40.323 2.632 15.322 < 2e16 ***
education5. Advanced Degree 66.813 2.848 23.462 < 2e16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.53 on 2995 degrees of freedom
Multiple Rsquared: 0.2348, Adjusted Rsquared: 0.2338
Fstatistic: 229.8 on 4 and 2995 DF, pvalue: < 2.2e16
Luckily for us, plot_model()
with the argument show.values = TRUE
transforms a boring summary table into this informative picture, which shows the increase in salary in thousands of dollars as compared to no education (Intercept, not shown) with 95% confidence intervals and significance stars which indicate that those increases in salary are significant.
(not in the video) Where vertical 0 indicates no effect (xaxis position 1 for most glmâ€™s and position 0 for most linear models)
plot_model(m, show.values = TRUE, width = 0.1)+
ylab("Increase in salary as compared to no education")
However, sometimes we still need to report the summary table, but we need to make it look better. And thatâ€™s where tab_model
command comes into play. Within tab_model
we can show the reference level, hide the intercept and change the style of pvalues.
tab_model(m,
show.reflvl = T,
show.intercept = F,
p.style = "numeric_stars")
Â  wage  

Predictors  Estimates  CI  p 

Reference  

11.68 ^{***}  6.74Â â€“Â 16.62  <0.001 

23.65 ^{***}  18.45Â â€“Â 28.85  <0.001 

40.32 ^{***}  35.16Â â€“Â 45.48  <0.001 

66.81 ^{***}  61.23Â â€“Â 72.40  <0.001 
Observations  3000  
R^{2} / R^{2} adjusted  0.235 / 0.234  

But the most amazing thing about plot_model
and tab_model
functions is that they work with almost any type of model you can imagine! I successfully used it for mixedeffects models, Bayesian models or negativebinomial models, to name a few. And the authors of the package constantly improve the functionality, so that, at the moment you read this blogpost, {sjPlot} package is most likely improved.
Here is an example of how ease we can visualize a very fancy model, namely a generalized linear mixedeffects regression for negativebinomial distribution of age with 3 way interaction term and a random effect of education.
m.nb < glmer.nb(age ~ wage * jobclass * health + (1education), data = Wage)
plot_model(m.nb, type = "int")[[4]]
This interactions show that industrial workers with a very good health earn 50 thousand dollars already at the age of 31, while IT crowd gets the same salary ca. 8 years later, however at the age of 45 the IT crowd catches on and even starts to slowly overtake the factory workers, and finally, while IT folks get to the salary of 300 thousand dollars already at the age of 50, factory workers might reach this kind of wealth only at the end of their carrier, at around 63 years old. And the nonoverlapping confidence intervals indicate that such difference in salaries is significant.
Besides, you can easily change the appearance of your results by changing the default order of predictors and even choose particular values from the numeric predictor. For instance, letâ€™s take three salary values 50, 150 & 300 as we just talked about them and display our results in a different way.
plot_model(m.nb, type = "pred", terms = c("health", "jobclass", "wage [50, 150, 300]"))
Moreover, type
argument allows to create various plot types. For example, we can easily visualize random effects if we want to.
plot_model(m.nb,
type = "re",
width = .5,
show.values = T) + ylim(0.9,1.1)
It only gets better from now. If we want to explore several dependent variables with the same predictors, we can use plot_models
function to plot several models at once. In the first code example weâ€™ll use already familiar argument  show.values
 and a new one  grid
 which plots models in separate fields to avoid congestion and overload of information on the picture.
# fit two models
fit1 < lm(age ~ education + jobclass + health_ins, data = Wage)
fit2 < lm(wage ~ education + jobclass + health_ins, data = Wage)
# plot multiple models
plot_models(fit1, fit2, show.values = T, grid = TRUE)
In the second example we avoid clutter by simply not using show.values
, since we can kind of read them from the xaxes, and weâ€™ll use p.shape = TRUE
argument instead, in order to display pvalues as shapes instead of significance stars.
plot_models(fit1, fit2, p.shape = TRUE)
tab_model
can also easily display multiple models. Here, collapse.ci = TRUE
argument conveniently puts confidence intervals below the estimates, so that we can report several models near each other.
tab_model(fit1, fit2,
collapse.ci = TRUE,
p.style = "numeric_stars")
Â  age  wage  

Predictors  Estimates  p  Estimates  p 
(Intercept) 
43.15 ^{***} (41.68Â â€“Â 44.63) 
<0.001 
93.72 ^{***} (89.14Â â€“Â 98.31) 
<0.001 
education [2. HS Grad] 
0.21 ^{} (1.75Â â€“Â 1.34) 
0.794 
8.15 ^{***} (3.34Â â€“Â 12.95) 
0.001 
education [3. Some College] 
2.01 ^{*} (3.65Â â€“Â 0.37) 
0.016 
17.89 ^{***} (12.79Â â€“Â 22.99) 
<0.001 
education [4. College Grad] 
0.48 ^{} (2.13Â â€“Â 1.17) 
0.568 
33.03 ^{***} (27.91Â â€“Â 38.15) 
<0.001 
education [5. Advanced Degree] 
1.35 ^{} (0.45Â â€“Â 3.16) 
0.142 
57.90 ^{***} (52.28Â â€“Â 63.52) 
<0.001 
jobclass [2. Information] 
1.42 ^{**} (0.56Â â€“Â 2.28) 
0.001 
3.67 ^{**} (1.00Â â€“Â 6.35) 
0.007 
health ins [2. No] 
3.29 ^{***} (4.20Â â€“Â 2.39) 
<0.001 
19.89 ^{***} (22.72Â â€“Â 17.07) 
<0.001 
Observations  3000  3000  
R^{2} / R^{2} adjusted  0.033 / 0.031  0.284 / 0.283  

By the way, if you want to visualize and test ALL the assumptions of ANY model with a SINGLE function, check out this video about another amazing package created by the same author  Daniel LÃ¼decke.
If you think, I missed something, please comment on it, and Iâ€™ll improve this tutorial.
Thank you for learning!
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/yuryzablotski/yuzaRBlog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zablotski (2023, Aug. 15). yuzaRBlog: R package reviews {sjPlot} How to Easily Visualize Data And Model Results. Retrieved from https://yuzarblog.netlify.app/posts/20220801sjplot/
BibTeX citation
@misc{zablotski2023r, author = {Zablotski, Yury}, title = {yuzaRBlog: R package reviews {sjPlot} How to Easily Visualize Data And Model Results}, url = {https://yuzarblog.netlify.app/posts/20220801sjplot/}, year = {2023} }