One picture is worth a thousand words. Thatâ€™s why visualizing data and model results is a crutial skill for any data scientist. {sjPlot} package became my favorite tool for visualization. Thatâ€™s why I want to share with you some simple but very effective commands which will make you more productive today. So, letâ€™s visualize Wage dataset, visualize bunch of models and see what people earn and what factors determine the salary.
I recommend to watch a video first, because I highlight things I talk about. Itâ€™s less then 9 minutes long.
view_df()
, plot_frq()
,
save_plot()
, plot_grpfrq()
,
plot_grid()
, plot_xtab()
,
tab_xtab()
, plot_gpt()
,
plot_likert()
, plot_model()
,
tab_model()
, plot_models()
view_df
View dataframe (view_df
) function with only 4
arguments, (1) your data, (2) show frequencies, (3) show percentages and
(4) show missing values, displays a range of numeric variables and the
counts + percentages of missing values and categorical variables, giving
you a nice big picture of your data.
view_df(Wage, show.frq = T, show.prc = T, show.na = T)
ID  Name  Label  missings  Values  Value Labels  Freq.  % 

1  year  0 (0.00%)  range: 20032009  
2  age  0 (0.00%)  range: 1880  
3  maritl  0 (0.00%) 

648 2074 19 204 55 
21.60 69.13 0.63 6.80 1.83 

4  race  0 (0.00%) 

2480 293 190 37 
82.67 9.77 6.33 1.23 

5  education  0 (0.00%) 

268 971 650 685 426 
8.93 32.37 21.67 22.83 14.20 

6  region  0 (0.00%) 

0 3000 0 0 0 0 0 0 0 
0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 

7  jobclass  0 (0.00%) 

1544 1456 
51.47 48.53 

8  health  0 (0.00%) 

858 2142 
28.60 71.40 

9  health_ins  0 (0.00%) 

2083 917 
69.43 30.57 

10  logwage  0 (0.00%)  range: 3.05.8  
11  wage  0 (0.00%)  range: 20.1318.3 
plot_frq
, plot_grpfrq
and
plot_grid
However, we often want to see an actual picture, for example, display
frequencies and percentages of categorical variables on a bar plot. For
that {sjPlot} package provides a convenient plotfrequencies
(plot_frq
) function, which does just that. For instance,
plotting education shows that around 9% of people in our data
did not finish a high school, while around 14% have a PhD.
Since {sjPlot} package works with tidyverse ðŸ¥³, we can easily group
the data by any other categorical variable, letâ€™s take race,
and get frequencies and percentages for every group.
plot_grid()
function puts several subplots in a single plot
and even names the subplots. For instance, a subplot C shows
that most of AfroAmericans in our dataset ARE highly educated. And of
coarse you can save this publicationready plot with â€¦ surprise
surpriiise â€¦ save_plot
command.
save_plot(filename = "race_vs_education.jpg", fig = p, width = 30, height = 19)
quartz_off_screen
2
While seeing counts and percentages of separate groups is cool, we
sometimes want to put groups directly near each other. And thatâ€™s
exactly what plotgroupedfrequencies (plot_grpfrq
)
function does. For instance, it clearly shows that most of the people
with lower education degrees work in factories, while
folks with higher education degrees work with
information.
plot_grpfrq(
var.cnt = Wage$education,
var.grp = Wage$jobclass)
This IS already useful, however, plot_xtab
function goes
one step further and displays percentages of jobclasses inside
of every educational degree as stackedbars,
where counts are identical to the previous plot, but every
educational category as one 100 percent. Such display only
reinforces our hypothesis that highly educated folks usually work in the
IT and shows a clear association between jobclass and
education. As if that were not enough, plot_xtab
even tests this hypothesis with the ChiSquared test of
association and displays a significant pvalue and a large
effect size.
# as stacked proportional bars
plot_xtab(
x = Wage$education,
grp = Wage$jobclass,
margin = "row",
bar.pos = "stack",
show.summary = TRUE,
coord.flip = TRUE)
So, plot_xtab
essentially visualizes cross
tables, also known as pivot tables. And if for
some reason you want an actual table with the results of a statistical
test, you can use tab_xtab
function instead.
tab_xtab(
var.row = Wage$education,
var.col = Wage$jobclass,
show.row.prc = T)
education  jobclass  Total  





190 70.9Â % 
78 29.1Â % 
268 100Â % 

636 65.5Â % 
335 34.5Â % 
971 100Â % 

342 52.6Â % 
308 47.4Â % 
650 100Â % 

274 40Â % 
411 60Â % 
685 100Â % 

102 23.9Â % 
324 76.1Â % 
426 100Â % 
Total 
1544 51.5Â % 
1456 48.5Â % 
3000 100Â % 
Ï‡^{2}=282.643 Â· df=4 Â· Cramerâ€™s V=0.307 Â· p=0.000 
(not part of the video) By the way, we can decide what kind of percentages are calculated, rows or columns or even single cells, whether we want to stack the bars and many more. It will automatically conduct Fisherâ€™s test of association if samples are small (<5).
The pvalues are based on chisq.test of x and y for each grp.
But enough about counting, since our dataset is about salaries, letâ€™s
figure our who earns more, industrial or information workers? Plot
frequencies function, which we used for counting, can also easily answer
this question if we give it a (1) wage
variable, (2) tell
it to plot a histogram
, (3) to show an average with
standard deviation and (4) to display a normal curve to see whether our
salaries are normally distributed. This visualization reveals that
industrial workers get 103 thousand dollars on average, while IT crowd
gets 17 thousand more.
The last thing Iâ€™d like to share with you before we visualize models,
is a visualization of likert scale data. If you have same scales or
categories across different variables, plot_likert
function
nicely compares percentages of scales or categories across those
variables.
data(pisaitems)
d < pisaitems %>%
dplyr::select(starts_with("ST25Q"))
view_df(d, show.frq = T, show.prc = T)
ID  Name  Label  Values  Value Labels  Freq.  % 

1  ST25Q01 
Never or almost never A few times a year About once a month Several times a month Several times a week 
6682 13143 13995 20353 11436 
10.18 20.03 21.33 31.02 17.43 

2  ST25Q02 
Never or almost never A few times a year About once a month Several times a month Several times a week 
24019 16789 10317 9489 4751 
36.75 25.68 15.78 14.52 7.27 

3  ST25Q03 
Never or almost never A few times a year About once a month Several times a month Several times a week 
11131 16164 12818 14569 10658 
17.04 24.74 19.62 22.30 16.31 

4  ST25Q04 
Never or almost never A few times a year About once a month Several times a month Several times a week 
20145 19961 12768 8797 3622 
30.85 30.57 19.55 13.47 5.55 

5  ST25Q05 
Never or almost never A few times a year About once a month Several times a month Several times a week 
12317 12141 10314 15645 15165 
18.78 18.51 15.73 23.86 23.12 
plot_likert(d)
Visualizing data is quite, but letâ€™s get to the really cool visualization stuff!
plot_model
function is the actual reason I love {sjPlot}
package. I literally use it everyday!
For example if I want to know how education influences salary, Iâ€™ll plot predictions from a simple linear model. Plotting prediction immediately tells me the story. Namely, people who did not even finish a high school, have the lowest salary compared to any other education level. Moreover, we can see that increasing education level means increasing salaries. So, education matters!
m < lm(wage ~ education, data = Wage)
plot_model(m, type = "pred")
$education
The only thing we canâ€™t see from this plot is whether this increase
is significant or not. We could use a well known summary
table for that, but the output, although useful, is not really pleasing
to the human eye and is not suitable for publication.
summary(m)
Call:
lm(formula = wage ~ education, data = Wage)
Residuals:
Min 1Q Median 3Q Max
112.31 19.94 3.09 15.33 222.56
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 84.104 2.231 37.695 < 2e16 ***
education2. HS Grad 11.679 2.520 4.634 3.74e06 ***
education3. Some College 23.651 2.652 8.920 < 2e16 ***
education4. College Grad 40.323 2.632 15.322 < 2e16 ***
education5. Advanced Degree 66.813 2.848 23.462 < 2e16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.53 on 2995 degrees of freedom
Multiple Rsquared: 0.2348, Adjusted Rsquared: 0.2338
Fstatistic: 229.8 on 4 and 2995 DF, pvalue: < 2.2e16
Luckily for us, plot_model()
with the argument
show.values = TRUE
transforms a boring summary table into
this informative picture, which shows the increase in salary in
thousands of dollars as compared to no education (Intercept,
not shown) with 95% confidence intervals and significance stars which
indicate that those increases in salary are significant.
(not in the video) Where vertical 0 indicates no effect (xaxis position 1 for most glmâ€™s and position 0 for most linear models)
plot_model(m, show.values = TRUE, width = 0.1)+
ylab("Increase in salary as compared to no education")
However, sometimes we still need to report the summary table, but we
need to make it look better. And thatâ€™s where tab_model
command comes into play. Within tab_model
we can show the
reference level, hide the intercept and change the style of
pvalues.
tab_model(m,
show.reflvl = T,
show.intercept = F,
p.style = "numeric_stars")
Â  wage  

Predictors  Estimates  CI  p 

Reference  

11.68 ^{***}  6.74Â â€“Â 16.62  <0.001 

23.65 ^{***}  18.45Â â€“Â 28.85  <0.001 

40.32 ^{***}  35.16Â â€“Â 45.48  <0.001 

66.81 ^{***}  61.23Â â€“Â 72.40  <0.001 
Observations  3000  
R^{2} / R^{2} adjusted  0.235 / 0.234  

But the most amazing thing about plot_model
and
tab_model
functions is that they work with almost any type
of model you can imagine! I successfully used it for mixedeffects
models, Bayesian models or negativebinomial models, to name a few. And
the authors of the package constantly improve the functionality, so
that, at the moment you read this blogpost, {sjPlot} package is most
likely improved.
Here is an example of how ease we can visualize a very fancy model, namely a generalized linear mixedeffects regression for negativebinomial distribution of age with 3 way interaction term and a random effect of education.
m.nb < glmer.nb(age ~ wage * jobclass * health + (1education), data = Wage)
plot_model(m.nb, type = "int")[[4]]
This interactions show that industrial workers with a very good health earn 50 thousand dollars already at the age of 31, while IT crowd gets the same salary ca. 8 years later, however at the age of 45 the IT crowd catches on and even starts to slowly overtake the factory workers, and finally, while IT folks get to the salary of 300 thousand dollars already at the age of 50, factory workers might reach this kind of wealth only at the end of their carrier, at around 63 years old. And the nonoverlapping confidence intervals indicate that such difference in salaries is significant.
Besides, you can easily change the appearance of your results by changing the default order of predictors and even choose particular values from the numeric predictor. For instance, letâ€™s take three salary values 50, 150 & 300 as we just talked about them and display our results in a different way.
plot_model(m.nb, type = "pred", terms = c("health", "jobclass", "wage [50, 150, 300]"))
Moreover, type
argument allows to create various plot
types. For example, we can easily visualize random effects if we want
to.
plot_model(m.nb,
type = "re",
width = .5,
show.values = T) + ylim(0.9,1.1)
It only gets better from now. If we want to explore several dependent
variables with the same predictors, we can use plot_models
function to plot several models at once. In the first code example weâ€™ll
use already familiar argument  show.values
 and a new one
 grid
 which plots models in separate fields to avoid
congestion and overload of information on the picture.
# fit two models
fit1 < lm(age ~ education + jobclass + health_ins, data = Wage)
fit2 < lm(wage ~ education + jobclass + health_ins, data = Wage)
# plot multiple models
plot_models(fit1, fit2, show.values = T, grid = TRUE)
In the second example we avoid clutter by simply not using
show.values
, since we can kind of read them from the
xaxes, and weâ€™ll use p.shape = TRUE
argument instead, in
order to display pvalues as shapes instead of significance stars.
plot_models(fit1, fit2, p.shape = TRUE)
tab_model
can also easily display multiple models. Here,
collapse.ci = TRUE
argument conveniently puts confidence
intervals below the estimates, so that we can report several models near
each other.
tab_model(fit1, fit2,
collapse.ci = TRUE,
p.style = "numeric_stars")
Â  age  wage  

Predictors  Estimates  p  Estimates  p 
(Intercept) 
43.15 ^{***} (41.68Â â€“Â 44.63) 
<0.001 
93.72 ^{***} (89.14Â â€“Â 98.31) 
<0.001 
education [2. HS Grad] 
0.21 ^{} (1.75Â â€“Â 1.34) 
0.794 
8.15 ^{***} (3.34Â â€“Â 12.95) 
0.001 
education [3. Some College] 
2.01 ^{*} (3.65Â â€“Â 0.37) 
0.016 
17.89 ^{***} (12.79Â â€“Â 22.99) 
<0.001 
education [4. College Grad] 
0.48 ^{} (2.13Â â€“Â 1.17) 
0.568 
33.03 ^{***} (27.91Â â€“Â 38.15) 
<0.001 
education [5. Advanced Degree] 
1.35 ^{} (0.45Â â€“Â 3.16) 
0.142 
57.90 ^{***} (52.28Â â€“Â 63.52) 
<0.001 
jobclass [2. Information] 
1.42 ^{**} (0.56Â â€“Â 2.28) 
0.001 
3.67 ^{**} (1.00Â â€“Â 6.35) 
0.007 
health ins [2. No] 
3.29 ^{***} (4.20Â â€“Â 2.39) 
<0.001 
19.89 ^{***} (22.72Â â€“Â 17.07) 
<0.001 
Observations  3000  3000  
R^{2} / R^{2} adjusted  0.033 / 0.031  0.284 / 0.283  

By the way, if you want to visualize and test ALL the assumptions of ANY model with a SINGLE function, check out this video about another amazing package created by the same author  Daniel LÃ¼decke.
If you think, I missed something, please comment on it, and Iâ€™ll improve this tutorial.
Thank you for learning!
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/yuryzablotski/yuzaRBlog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zablotski (2022, Aug. 12). yuzaRBlog: R package reviews {sjPlot} How to Easily Visualize Data And Model Results. Retrieved from https://yuzarblog.netlify.app/posts/20220801sjplot/
BibTeX citation
@misc{zablotski2022r, author = {Zablotski, Yury}, title = {yuzaRBlog: R package reviews {sjPlot} How to Easily Visualize Data And Model Results}, url = {https://yuzarblog.netlify.app/posts/20220801sjplot/}, year = {2022} }