hypothesis: Non-linear hypothesis testing

Description.

  • Summary statistics of the posterior distributions related to the hypotheses.

Run the code above in your browser using DataCamp Workspace

An R companion to Statistics: data analysis and modelling

Chapter 12 bayesian estimation with brms, 12.1 the brms package.

brms ( Bürkner 2023 ) stands for Bayesian Regression Models using Stan. ( https://mc-stan.org/ )[Stan] is general software for Bayesian model specification and estimation, implementing Hamiltonian Monte Carlo as the main MCMC algorithm. Due to its efficiency and flexibility, it has quickly become one of the most popular software packages for Bayesian modelling. Stan uses its own language to define Bayesian models, which may pose a learning curve. The brms package provides an easy-to-use interface to Stan for estimating generalized linear mixed-effects models. As this covers a large part of the models used for data analysis in psychology,

12.1.1 Model definition

Defining models in brms is relatively straightforward, as the package relies on a similar formula interface as lm , glm , and lme4 . The main workhorse is the brms::brm() function. Main arguments for brms::brm() are:

  • formula : a model formula, using the lme4 syntax. Table 12.1 contains a reminder of this syntax.
  • data : the data.frame containing the variables used in the model formula.
  • family : similar to that used in glm() , a description of the response distribution (likelihood). By default, this is a Normal distribution.
  • prior : definition of the prior distributions. If not specified, default priors are used for all parameters.
  • sample_prior : logical, to indicate whether you also want to obtain draws from the prior distributions.
  • chains : number of parallel MCMC chains to run (defaults to 4).
  • iter : total number of MCMC iterations per chain (defaults to 2000).
  • warmup : number of samples to discard as a warmup or burn-in sample (defaults to iter/2 ).
  • thin : the so-called thinning rate, which can be used to only keep every \(k\) th sample.
  • seed : may be used to set the random seed for replication of results.
  • cores : specify the number of cores of the CPU to be used to run the chains in parallel. This defaults to 1 (each chain is run after the other). If you have a multicore CPU, it is better to set this higher (e.g. to 4) to speed-up computation time.

12.1.1.1 Brms family

The family argument in brms::brm() is used to define the random part of the model. The brms package extends the options of the family argument in the glm() function to allow for a much wider class of likelihoods. You can see the help file ( help("brmsfamily", package="brms") ) for a full list of the current options. Some examples of available options are:

  • Normal distribution, for linear regression and the GLM: gaussian(link = "identity")
  • Generalized Student \(t\) -distribution, useful for robust linear regression less influenced by outliers: student(link = "identity", link_sigma = "log", link_nu = "logm1")
  • Bernoulli distribution, for logistic regression: bernoulli(link = "logit")
  • Poisson distribution, useful for unbounded count data: poisson(link = "log")
  • Categorical distribution, for multinomial logistic regression: categorical(link = "logit", refcat = NULL)
  • Beta-distribution, for regression of a dependent variable bounded between 0 and 1 (e.g. proportions): Beta(link = "logit", link_phi = "log")
  • Ex-Gaussian distribution, popular for modelling reaction times: exgaussian(link = "identity", link_sigma = "log", link_beta = "log")

And there are many more options. There is even a so-called Wiener diffusion distribution implemented, which can be used to estimate the drift-diffusion model ( wiener(link = "identity", link_bs = "log", link_ndt = "log", link_bias = "logit") ).

Note that many of the brmsfamily objects have multiple link functions, one for each parameter of the distribution. For example, in student(link = "identity", link_sigma = "log", link_nu = "logm1") , there is one link function for the mean ( link ), one for the standard deviation ( link_sigma ), and one for the degrees of freedom ( link_nu ). This family implements a generalized \(t\) -distribution, which, in contrast to the standard \(t\) -distribution, has a location and scale parameter like the Normal distribution. But the degrees of freedom allow this distribution to have wider tails than the Normal distribution, which can make it more robust against outliers.

As an example of estimating a model with brms , we can estimate a linear regression model as follows:

Because estimation with brms involves random sampling via stan , we use the seed argument to be able to reproduce the results.

You can obtain a summary of the model via the summary() function:

The output first shows details of the model (e.g. family and model formula), and MCMC estimation (e.g. number of chains and iterations). Under Population-Level Effects: you will find a summary of the approximate posterior distribution for the main parameters of interest ( Intercept , and slopes for hate_groups_per_million and percent_bachelors_degree_or_higher ). The Estimate is the posterior mean, the Est.Error is the posterior standard deviation, the l-95% CI and u-95% CI provide respectively the lower and upper bound of the 95% Credible Interval (CI) computed with the Equal-Tailed Interval (ETI) method. The width of the latter interval can be changed by supplying a prob argument to the summary function. For example, summary(mod, prob=.9) would return the 90% CI.

The Rhat the \(\hat{R}\) values, and Bulk_ESS an estimate of the Effective Sample Size (ESS). The Tail_ESS provides somewhat of a lower bound on the ESS estimate (see Vehtari et al. 2019 ) . Under Family Specific Parameters: , you will find the same information for the non-regression parameters (i.e.  \(\sigma_\epsilon\) for a linear regression model).

You can obtain a quick plot showing a nonparametric density estimate for each parameter, and a plot of the sampled values per iteration and chain, via the plot() function:

characteristics of hypothesis in brm

If you want to create your own plots, there are several functions to extract the posterior samples of the parameters. Perhaps the most convenient is brms::as_draws_df() , which returns a data.frame with the posterior samples.

12.1.2 Priors

In the example above, we did not specify prior distributions. The brms::brm function uses default values for prior distributions if these are not explicitly set. You can view some detauls of these prior distributions via the prior_summary() function:

The output indicates that “flat” priors are used for the slopes of hate_groups_per_million and percent_bachelors_degree_or_higher . The “flat” priors are so-called improper priors , which indicate that any possible value is equally likely. These priors are therefore also uninformative priors . The prior for \(\sigma_\epsilon\) , the error SD ( sigma ), the prior is a (half) \(t\) -distribution with \(\text{df}=3\) , mean 0, and standard deviation 11.9. The SD is close to the SD of the dependent variable ( percent_Trump_votes ). This is reasonable in providing a weakly informative prior, as the error SD should be smaller than the SD of the dependent variable.

For the intercept, a \(t\) -distribution is used, with \(\text{df}=3\) , a mean of 49.3, and an SD of 11.9. The mean and SD of the prior are close to the mean and SD of percent_Trump_votes . So the prior for the intercept is based on characteristics of the dependent variable, which allows it to automatically have a reasonable scale. Note that the intercept in a brms::brm() model is computed separately, using centered predictors. It then corrects the sampled intercepts again after estimation, to provide an intercept for non-centered predictors. For more details, you can see the brms manual ( help("brmsformula", package="brms") ). If you need to define a prior on the scale of the usual intercept, this behaviour needs to be turned off first.

Default priors are generally fine. But if you want to define your own priors, this is possible with the prior argument. The prior argument expects a vector with prior specifications for each parameter for which you want to set a different prior than default. Each element in this vector needs to be an object of the brmsprior class. The easiest way to obtain such objects is by calling the brms::prior() function. This function has as important arguments: * prior : a specification of a distribution in the Stan language. There are many such distributions available (see the Stan Functions Reference for the full list). Common examples are normal(mean, sd) , student(df, mean, sd) . cauchy(mean, sd) , beta(alpha, beta) , where the values between the parentheses indicate so-called hyper-parameters (i.e. the parameters defining the distribution). I have used df , mean , and sd here, but the Stan manual uses nu , mu , and sigma for these. * class : the class of the parameter. This can for example be b for a regression coefficient (slope), sigma for an error SD, and Intercept for the intercept. This is what you will find under the class column in the output of brms::prior_summary() . Depending on the class, the prior may be restricted to respect the bounds for the parameter. For example, for a prior with class=sigma , the prior will always be truncated at 0, without you having to specify. * coef : the name of the coefficient in the parameter class. This is what you will find under the coef column in the output of brms::prior_summary() . * group : A grouping factor for group-level (random) effects.

For example, we can set Normal priors with a mean of 0 and a standard deviation depending on the parameters as follows. We use the c() function to create the vector with brmsprior objects. We need to define four such objects, each via a call to the prior() function. We assign the result to ab object called mod_priors , so that we can then use it in a call to brms::brm() :

We can now use our mod_priors object as the argument to prior , to estimate a regression model with non-default priors:

I have set sample_prior = TRUE so that we also get samples from the prior distributions. For example, we can get a histogram of the prior for \(\sigma_epsilon\) as:

characteristics of hypothesis in brm

12.2 Useful plots for MCMC samples

Besides the default plot() function, the brms::mcmc_plot() function can be used to obtain plots. Important arguments to this function are:

  • object : a fitted brms::brm() model
  • type : A specification of the type of plot. For example, hist (histogram), hist_by_chain (separate histograms for each MCMC chain), dens (nonparametric density plot), dens_overlay (overlaid density plots for each chain), violin (violin plots per chain), intervals (CI intervals for each parameter in the same plot), areas (densities for each parameter in the same plot, and trace (sampled parameters per iteration). For full details of all the options, see help("MCMC-overview", package="bayesplot") .
  • variable : Names of the variables (parameters) to plot, as e.g. a character vector. The names of the variables correspond to those provided as the column names in brms::as_draws_df(mod) .

For example, we can get histograms of the posterior for each parameter as

characteristics of hypothesis in brm

12.3 An example of a Poisson mixed-effects regression model

As an example of estimating a more complex model, we will use the brms package to estimate a Poisson mixed-effects model. We will use the gestures data of the sdamr package for this example. We first load the data, set contrasts, and define an offset variable:

We now define the model in brms , using a random effects specification to indicate we want random intercepts for each participant ( (1|ID) ), and an addition term offset(log_d) which sets the offset (unlike in glm and glmer , in brms the offset needs to be part of the model formula).

Note that I’m specifying to use cores=4 CPU cores here. This will speed up the computation, but requires you to have sufficient cores.

We have relied on the default priors, which are the usual flat priors for the regression coefficients, a \(t\) -distribution for the Intercept, and half- \(t\) distributions for the random effects and errors:

The summary of the approximate posterior is:

Whilst the \(\hat{R}\) values look OK, the effective sample size is rather low for the Intercept , language1 , gender , and language1:gender1 effects. We first inspect the autocorrelation for these parameters:

characteristics of hypothesis in brm

Thinning may improve the issues with autocorrelation. We rerun the model with different settings for the number of iterations and thinning via the brms::update() function, providing the estimated model as the first argument, and then specifying the number of iterations and thinning required. For example, we can rerun the model again using 1000 iterations as warmup, but now running for an additional 4000 iterations, leading to a total iter=5000 . We furthermore set the thinning to thin=4 to only keep each fourth iteration, which should reduce the autocorrelation:

Checking the model summary shows that this had the desired effect:

Understanding rumination as a form of inner speech

A an introduction to bayesian multilevel models using brms.

Bayesian multilevel models are increasingly used to overcome the limitations of frequentist approaches in the analysis of complex structured data. This paper introduces Bayesian multilevel modelling for the specific analysis of speech data, using the brms package developed in R . In this tutorial, we provide a practical introduction to Bayesian multilevel modelling, by reanalysing a phonetic dataset containing formant (F1 and F2) values for five vowels of Standard Indonesian (ISO 639-3:ind), as spoken by eight speakers (four females), with several repetitions of each vowel. We first give an introductory overview of the Bayesian framework and multilevel modelling. We then show how Bayesian multilevel models can be fitted using the probabilistic programming language Stan and the R package brms , which provides an intuitive formula syntax. Through this tutorial, we demonstrate some of the advantages of the Bayesian framework for statistical modelling and provide a detailed case study, with complete source code for full reproducibility of the analyses ( https://osf.io/dpzcb/ ). 64

A.1 Introduction

The last decade has witnessed noticeable changes in the way experimental data are analysed in phonetics, psycholinguistics, and speech sciences in general. In particular, there has been a shift from analysis of variance (ANOVA) to linear mixed models , also known as hierarchical models or multilevel models (MLMs), spurred by the spreading use of data-oriented programming languages such as R (R Core Team, 2018 ) , and by the enthusiasm of its active and ever growing community. This shift has been further sustained by the current transition in data analysis in social sciences, with researchers evolving from a widely criticised point-hypothesis mechanical testing (e.g., Bakan, 1966 ; Gigerenzer, 2004 ; Kline, 2004 ; Lambdin, 2012 ; Trafimow et al., 2018 ) to an approach that emphasises parameter estimation, model comparison, and continuous model expansion (e.g., Cumming, 2012 , 2014 ; Gelman et al., 2013 ; Gelman & Hill, 2006 ; Kruschke, 2015 ; Kruschke & Liddell, 2018 b , 2018 a ; McElreath, 2016 b ) .

MLMs offer great flexibility in the sense that they can model statistical phenomena that occur on different levels. This is done by fitting models that include both constant and varying effects (sometimes referred to as fixed and random effects). Among other advantages, this makes it possible to generalise the results to unobserved levels of the groups existing in the data (e.g., stimulus or participant, Janssen, 2012 ) . The multilevel strategy can be especially useful when dealing with repeated measurements (e.g., when measurements are nested into participants) or with unequal sample sizes, and more generally, when handling complex dependency structures in the data. Such complexities are frequently found in the kind of experimental designs used in speech science studies, for which MLMs are therefore particularly well suited.

The standard MLM is usually fitted in a frequentist framework, with the lme4 package (Bates, Maechler, Bolker, & Walker, 2018 ) in R (R Core Team, 2018 ) . However, when one tries to include the maximal varying effect structure, this kind of model tends either not to converge, or to give aberrant estimations of the correlation between varying effects (e.g., Bates, Kliegl, Vasishth, & Baayen, 2015 ) . 65 Yet, fitting the maximal varying effect structure has been explicitly recommended (e.g., Barr, Levy, Scheepers, & Tily, 2013 ) . In contrast, the maximal varying effect structure can generally be fitted in a Bayesian framework (Bates et al., 2015 ; Eager & Roy, 2017 ; Nicenboim & Vasishth, 2016 ; Sorensen et al., 2016 ) .

Another advantage of Bayesian statistical modelling is that it fits the way researchers intuitively understand statistical results. Widespread misinterpretations of frequentist statistics (like p-values and confidence intervals) are often attributable to the wrong interpretation of these statistics as resulting from a Bayesian analysis (e.g., Dienes, 2011 ; Gigerenzer, 2004 ; Hoekstra, Morey, Rouder, & Wagenmakers, 2014 ; Kruschke & Liddell, 2018 a ; Morey et al., 2015 ) . However, the intuitive nature of the Bayesian approach might arguably be hidden by the predominance of frequentist teaching in undergraduate statistical courses.

Moreover, the Bayesian approach offers a natural solution to the problem of multiple comparisons, when the situation is adequately modelled in a multilevel framework (Gelman, Hill, & Yajima, 2012 ; Scott & Berger, 2010 ) , and allows a priori knowledge to be incorporated in data analysis via the prior distribution. The latter feature is particularily relevant when dealing with contraint parameters or for the purpose of incorporating expert knowledge.

The aim of the current paper is to introduce Bayesian multilevel models, and to provide an accessible and illustrated hands-on tutorial for analysing typical phonetic data. This paper will be structured in two main parts. First, we will briefly introduce the Bayesian approach to data analysis and the multilevel modelling strategy. Second, we will illustrate how Bayesian MLMs can be implemented in R by using the brms package (Bürkner, 2018 b ) to reanalyse a dataset from McCloy ( 2014 ) available in the phonR package (McCloy, 2016 ) . We will fit Bayesian MLMs of increasing complexity, going step by step, providing explanatory figures and making use of the tools available in the brms package for model checking and model comparison. We will then compare the results obtained in a Bayesian framework using brms with the results obtained using frequentist MLMs fitted with lme4 . Throughout the paper, we will also provide comments and recommendations about the feasability and the relevance of such analysis for the researcher in speech sciences.

A.1.1 Bayesian data analysis

The Bayesian approach to data analysis differs from the frequentist one in that each parameter of the model is considered as a random variable (contrary to the frequentist approach which considers parameter values as unknown and fixed quantities), and by the explicit use of probability to model the uncertainty (Gelman et al., 2013 ) . The two approaches also differ in their conception of what probability is. In the Bayesian framework, probability refers to the experience of uncertainty, while in the frequentist framework it refers to the limit of a relative frequency (i.e., the relative frequency of an event when the number of trials approaches infinity). A direct consequence of these two differences is that Bayesian data analysis allows researchers to discuss the probability of a parameter (or a vector of parameters) \(\theta\) , given a set of data \(y\) :

\[p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}\]

Using this equation (known as Bayes’ theorem), a probability distribution \(p(\theta|y)\) can be derived (called the posterior distribution ), that reflects knowledge about the parameter, given the data and the prior information. This distribution is the goal of any Bayesian analysis and contains all the information needed for inference.

The term \(p(\theta)\) corresponds to the prior distribution , which specifies the prior information about the parameters (i.e., what is known about \(\theta\) before observing the data) as a probability distribution. The left hand of the numerator \(p(y|\theta)\) represents the likelihood , also called the sampling distribution or generative model , and is the function through which the data affect the posterior distribution. The likelihood function indicates how likely the data are to appear, for each possible value of \(\theta\) .

Finally, \(p(y)\) is called the marginal likelihood . It is meant to normalise the posterior distribution, that is, to scale it in the “probability world”. It gives the “probability of the data”, summing over all values of \(\theta\) and is described by \(p(y) = \sum_{\theta} p(\theta) p(y|\theta)\) for discrete parameters, and by \(p(y) = \int p(\theta) p(y|\theta) d\theta\) in the case of continuous parameters.

All this pieced together shows that the result of a Bayesian analysis, namely the posterior distribution \(p(\theta|y)\) , is given by the product of the information contained in the data (i.e., the likelihood) and the information available before observing the data (i.e., the prior). This constitutes the crucial principle of Bayesian inference, which can be seen as an updating mechanism (as detailed for instance in Kruschke & Liddell, 2018 b ) . To sum up, Bayes’ theorem allows a prior state of knowledge to be updated to a posterior state of knowledge, which represents a compromise between the prior knowledge and the empirical evidence.

The process of Bayesian analysis usually involves three steps that begin with setting up a probability model for all the entities at hand, then computing the posterior distribution, and finally evaluating the fit and the relevance of the model (Gelman et al., 2013 ) . In the context of linear regression, for instance, the first step would require to specify a likelihood function for the data and a prior distribution for each parameter of interest (e.g., the intercept or the slope). We will go through these three steps in more details in the application section, but we will first give a brief overview of the multilevel modelling strategy.

A.1.2 Multilevel modelling

MLMs can be considered as “multilevel” for at least two reasons. First, an MLM can generally be conceived as a regression model in which the parameters are themselves modelled as outcomes of another regression model. The parameters of this second-level regression are known as hyperparameters , and are also estimated from the data (Gelman & Hill, 2006 ) . Second, the multilevel structure can arise from the data itself, for instance when one tries to model the second-language speech intelligibility of a child, who is considered within a particular class, itself considered within a particular school. In such cases, the hierarchical structure of the data itself calls for hierarchical modelling. In both conceptions, the number of levels that can be handled by MLMs is virtually unlimited (McElreath, 2016 b ) . When we use the term multilevel in the following, we will refer to the structure of the model, rather than to the structure of the data, as non-nested data can also be modelled in a multilevel framework.

As briefly mentioned earlier, MLMs offer several advantages compared to single-level regression models, as they can handle the dependency between units of analysis from the same group (e.g., several observations from the same participant). In other words, they can account for the fact that, for instance, several observations are not independent, as they relate to the same participant. This is achieved by partitioning the total variance into variation due to the groups (level-2) and to the individual (level-1). As a result, such models provide an estimation of the variance component for the second level (i.e., the variability of the participant-specific estimates) or higher levels, which can inform us about the generalisability of the findings (Janssen, 2012 ; McElreath, 2016 b ) .

Multilevel modelling allows both fixed and random effects to be incorporated. However, as pointed out by Gelman ( 2005 ) , we can find at least five different (and sometimes contradictory) ways of defining the meaning of the terms fixed and random effects. Moreover, Gelman & Hill ( 2006 ) remarked that what is usually called a fixed effect can generally be conceived as a random effect with a null variance. In order to use a consistent vocabulary, we follow the recommendations of Gelman & Hill ( 2006 ) and avoid these terms. We instead use the more explicit terms constant and varying to designate effects that are constant, or that vary by groups. 66

A question one is frequently faced with in multilevel modelling is to know which parameters should be considered as varying, and which parameters should be considered as constant. A practical answer is provided by McElreath ( 2016 b ) , who states that “any batch of parameters with exchangeable index values can be and probably should be pooled”. For instance, if we are interested in the categorisation of native versus non-native phonemes and if for each phoneme in each category there are multiple audio stimuli (e.g., multiple repetitions of the same phoneme), and if we do not have any reason to think that, for each phoneme, audio stimuli may differ in intelligibility in any systematic way, then repetitions of the same phoneme should be pooled together. The essential feature of this strategy is that exchangeability of the lower units (i.e., the multiple repetitions of the same phoneme) is achieved by conditioning on indicator variables (i.e., the phonemes) that represent groupings in the population (Gelman et al., 2013 ) .

To sum up, multilevel models are useful as soon as there are predictors at different levels of variation (Gelman et al., 2013 ) . One important aspect is that this varying-coefficients approach allows each subgroup to have a different mean outcome level, while still estimating the global mean outcome level. In an MLM, these two estimations inform each other in a way that leads to the phenomenon of shrinkage , that will be discussed in more detail below (see section A.2.3 ).

As an illustration, we will build an MLM starting from the ordinary linear regression model, and trying to predict an outcome \(y_{i}\) (e.g., second-language (L2) speech-intelligibility) by a linear combination of an intercept \(\alpha\) and a slope \(\beta\) that quantifies the influence of a predictor \(x_{i}\) (e.g., the number of lessons received in this second language):

\[ \begin{aligned} y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \beta x_{i} \\ \end{aligned} \]

This notation is strictly equivalent to the (maybe more usual) following notation:

\[ \begin{aligned} y_{i} &= \alpha + \beta x_{i} + \epsilon_{i} \\ \epsilon_{i} &\sim \mathrm{Normal}(0,\sigma_e) \end{aligned} \]

We prefer to use the first notation as it generalises better to more complex models, as we will see later. In Bayesian terms, these two lines describe the likelihood of the model, which is the assumption made about the generative process from which the data is issued. We make the assumption that the outcomes \(y_{i}\) are normally distributed around a mean \(\mu_{i}\) with some error \(\sigma_{e}\) . This is equivalent to saying that the errors are normally distributed around \(0\) , as illustrated by the above equivalence. Then, we can extend this model to the following multilevel model, adding a varying intercept:

\[ \begin{aligned} y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha_{j[i]} + \beta x_{i} \\ \alpha_{j} &\sim \mathrm{Normal}(\alpha, \sigma_{\alpha}) \\ \end{aligned} \]

where we use the notation \(\alpha_{j[i]}\) to indicate that each group \(j\) (e.g., class) is given a unique intercept, issued from a Gaussian distribution centered on \(\alpha\) , the grand intercept, 67 meaning that there might be different mean scores for each class. From this notation we can see that in addition to the residual standard deviation \(\sigma_{e}\) , we are now estimating one more variance component \(\sigma_{\alpha}\) , which is the standard deviation of the distribution of varying intercepts. We can interpret the variation of the parameter \(\alpha\) between groups \(j\) by considering the intra-class correlation (ICC) \(\sigma_{\alpha}^{2} / (\sigma_{\alpha}^{2} + \sigma_{e}^{2})\) , which goes to \(0\) , if the grouping conveys no information, and to \(1\) , if all observations in a group are identical (Gelman & Hill, 2006 , p. 258) .

The third line is called a prior distribution in the Bayesian framework. This prior distribution describes the population of intercepts, thus modelling the dependency between these parameters.

Following the same strategy, we can add a varying slope, allowed to vary according to the group \(j\) :

\[ \begin{aligned} y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha_{j[i]} + \beta_{j[i]} x_{i} \\ \alpha_{j} &\sim \mathrm{Normal}(\alpha, \sigma_{\alpha}) \\ \beta_{j} &\sim \mathrm{Normal}(\beta, \sigma_{\beta}) \\ \end{aligned} \]

Indicating that the effect of the number of lessons on L2 speech intelligibility is allowed to differ from one class to another (i.e., the effect of the number of lessons might be more beneficial to some classes than others). These varying slopes are assigned a prior distribution centered on the grand slope \(\beta\) , and with standard deviation \(\sigma_{\beta}\) .

In this introductory section, we have presented the foundations of Bayesian analysis and multilevel modelling. Bayes’ theorem allows prior knowledge about parameters to be updated according to the information conveyed by the data, while MLMs allow complex dependency structures to be modelled. We now move to a detailed case study in order to illustrate these concepts.

A.1.3 Software programs

Sorensen et al. ( 2016 ) provided a detailed and accessible introduction to Bayesian MLMs (BMLMs) applied to linguistics, using the probabilistic language Stan (Carpenter et al., 2017 ) . However, discovering BMLMs and the Stan language all at once might seem a little overwhelming, as Stan can be difficult to learn for users that are not experienced with programming languages. As an alternative, we introduce the brms package (Bürkner, 2018 b ) , that implements BMLMs in R , using Stan under the hood, with an lme4 -like syntax. Hence, the syntax required by brms will not surprise the researcher familiar with lme4 , as models of the following form:

\[ \begin{aligned} y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \alpha_{subject[i]} + \beta x_{i} \\ \end{aligned} \]

are specified in brms (as in lme4 ) with: y ~ 1 + x + (1|subject) . In addition to linear regression models, brms allows generalised linear and non-linear multilevel models to be fitted, and comes with a great variety of distribution and link functions. For instance, brms allows fitting robust linear regression models, or modelling dichotomous and categorical outcomes using logistic and ordinal regression models. The flexibility of brms also allows for distributional models (i.e., models that include simultaneous predictions of all response parameters), Gaussian processes or non-linear models to be fitted, among others. More information about the diversity of models that can be fitted with brms and their implementation is provided in Bürkner ( 2018 b ) and Bürkner ( 2018 a ) .

A.2 Application example

To illustrate the use of BMLMs, we reanalysed a dataset from McCloy ( 2014 ) , available in the phonR package (McCloy, 2016 ) . This dataset contains formant (F1 and F2) values for five vowels of Standard Indonesian (ISO 639-3:ind), as spoken by eight speakers (four females), with approximately 45 repetitions of each vowel. The research question we investigated here is the effect of gender on vowel production variability.

A.2.1 Data pre-processing

Our research question was about the different amount of variability in the respective vowel productions of male and female speakers, due to cognitive or social differences. To answer this question, we first needed to get rid of the differences in vowel production that are due to physiological differences between males and females (e.g., shorter vocal tract length for females). More generally, we needed to eliminate the inter-individual differences due to physiological characteristics in our groups of participants. For that purpose, we first applied the Watt & Fabricius formant normalisation technique (Watt & Fabricius, 2002 ) . The principle of this method is to calculate for each speaker a “centre of gravity” \(S\) in the F1/F2 plane, from the formant values of point vowels [i, a , u], and to express the formant values of each observation as ratios of the value of \(S\) for that formant.

Euclidean distances between each observation and the centres of gravity corresponding to each vowel across all participants, by gender (top row: female, bottom row: male) and by vowel (in column), in the normalised F1-F2 plane. The grey background plots represent the individual data collapsed for all individuals (male and female) and all vowels. Note that, for the sake of clarity, this figure represents a unique center of gravity for each vowel for all participants, whereas in the analysis, one center of gravity was used for each vowel and each participant.

Figure A.1: Euclidean distances between each observation and the centres of gravity corresponding to each vowel across all participants, by gender (top row: female, bottom row: male) and by vowel (in column), in the normalised F1-F2 plane. The grey background plots represent the individual data collapsed for all individuals (male and female) and all vowels. Note that, for the sake of clarity, this figure represents a unique center of gravity for each vowel for all participants, whereas in the analysis, one center of gravity was used for each vowel and each participant.

Then, for each vowel and participant, we computed the Euclidean distance between each observation and the centre of gravity of the whole set of observations in the F1-F2 plane for that participant and that vowel. The data obtained by this process are illustrated in Figure A.1 , and a sample of the final dataset can be found in Table A.1 .

A.2.2 Constant effect of gender on vowel production variability

We then built a first model with constant effects only and vague priors on \(\alpha\) and \(\beta\) , the intercept and the slope. We contrast-coded gender (f = -0.5, m = 0.5). Our dependent variable was therefore the distance from each individual vowel centre of gravity, which we will refer to as formant distance in the following. The formal model can be expressed as:

\[ \begin{aligned} \text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \beta \times \text{gender}_{i} \\ \alpha &\sim \mathrm{Normal}(0, 10) \\ \beta &\sim \mathrm{Normal}(0, 10) \\ \sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\ \end{aligned} \]

where the first two lines of the model describe the likelihood and the linear model. 68 The next three lines define the prior distribution for each parameter of the model, where \(\alpha\) and \(\beta\) are given a vague (weakly informative) Gaussian prior centered on \(0\) , and the residual variation is given a Half-Cauchy prior (Gelman, 2006 ; Polson & Scott, 2012 ) , thus restricting the range of possible values to positive ones. As depicted in Figure A.2 , the \(\mathrm{Normal}(0,10)\) prior is weakly informative in the sense that it grants a relative high weight to \(\alpha\) and \(\beta\) values, between -25 and 25. This corresponds to very large (given the scale of our data) values for, respectively, the mean distance value \(\alpha\) , and the mean difference between males and females \(\beta\) . The \(\mathrm{HalfCauchy}(10)\) prior placed on \(\sigma_{e}\) also allows very large values of \(\sigma_{e}\) , as represented in the right panel of Figure A.2 .

Prior distributions used in the first model, for $\alpha$ and $\beta$ (left panel) and for the residual variation $\sigma_{e}$ (right panel).

Figure A.2: Prior distributions used in the first model, for \(\alpha\) and \(\beta\) (left panel) and for the residual variation \(\sigma_{e}\) (right panel).

These priors can be specified in numerous ways (see ?set_prior for more details), among which the following:

where a prior can be defined over a class of parameters (e.g., for all variance components, using the sd class) or for a specific one, for instance as above by specifying the coefficient ( coef ) to which the prior corresponds (here the slope of the constant effect of gender).

The model can be fitted with brms with the following command:

where distance is the distance from the centre of gravity. The iter argument serves to specify the total number of iterations of the Markov Chain Monte Carlo (MCMC) algorithm, and the warmup argument specifies the number of iterations that are run at the beginning of the process to “calibrate” the MCMC, so that only iter - warmup iterations are retained in the end to approximate the shape of the posterior distribution (for more details, see McElreath, 2016 b ) .

Figure A.3 depicts the estimations of this first model for the intercept \(\alpha\) , the slope \(\beta\) , and the residual standard deviation \(\sigma_{e}\) . The left part of the plot shows histograms of draws taken from the posterior distribution, and from which several summaries can be computed (e.g., mean, mode, quantiles). The right part of Figure A.3 shows the behaviour of the two simulations (i.e., the two chains) used to approximate the posterior distribution, where the x-axis represents the number of iterations and the y-axis the value of the parameter. This plot reveals one important aspect of the simulations that should be checked, known as mixing . A chain is considered well mixed if it explores many different values for the target parameters and does not stay in the same region of the parameter space. This feature can be evaluated by checking that these plots, usually referred to as trace plots , show random scatter around a mean value (they look like a “fat hairy caterpillar”).

Histograms of posterior samples and trace plots of the intercept, the slope for gender and the standard deviation of the residuals of the constant effects model.

Figure A.3: Histograms of posterior samples and trace plots of the intercept, the slope for gender and the standard deviation of the residuals of the constant effects model.

The estimations obtained for this first model are summarised in Table A.2 , which includes the mean, the standard error (SE), and the lower and upper bounds of the 95% credible interval (CrI) 69 of the posterior distribution for each parameter. As gender was contrast-coded before the analysis (f = -0.5, m = 0.5), the intercept \(\alpha\) corresponds to the grand mean of the formant distance over all participants and has its mean around 0.163. The estimate of the slope ( \(\beta =\) -0.042) suggests that females are more variable than males in the way they pronounce vowels, while the 95% CrI can be interpreted in a way that there is a \(0.95\) probability that the value of the intercept lies in the [-0.052, -0.033] interval.

The Rhat value corresponds to the potential scale reduction factor \(\hat{R}\) (Gelman & Rubin, 1992 ) , that provides information about the convergence of the algorithm. This index can be conceived as equivalent to the F-ratio in ANOVA. It compares the between-chains variability (i.e., the extent to which different chains differ one from each other) to the within-chain variability (i.e., how widely a chain explores the parameter space), and, as such, gives an index of the convergence of the chains. An overly large between-chains variance (as compared to the within-chain variability) would be a sign that chain-specific characteristics, like the starting value of the algorithm, have a strong influence on the final result. Ideally, the value of Rhat should be close to 1, and should not exceed 1.1. Otherwise, one might consider running more iterations or defining stronger priors (Bürkner, 2018 b ; Gelman et al., 2013 ) .

A.2.3 Varying intercept model

The first model can be improved by taking into account the dependency between vowel formant measures for each participant. This is handled in MLMs by specifying unique intercepts \(\alpha_{subject[i]}\) and by assigning them a common prior distribution. This strategy corresponds to the following by-subject varying-intercept model, bmod2 :

\[ \begin{aligned} \text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \alpha_{subject[i]} + \beta \times \text{gender}_{i} \\ \alpha_{subject} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\ \alpha &\sim \mathrm{Normal}(0, 10) \\ \beta &\sim \mathrm{Normal}(0, 10) \\ \sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\ \end{aligned} \]

This model can be fitted with brms with the following command (where we specify the HalfCauchy prior on \(\sigma_{subject}\) by applying it on parameters of class sd ):

As described in the first part of the present paper, we now have two sources of variation in the model: the standard deviation of the residuals \(\sigma_{e}\) and the standard deviation of the by-subject varying intercepts \(\sigma_{subject}\) . The latter represents the standard deviation of the population of varying intercepts, and is also learned from the data. It means that the estimation of each unique intercept will inform the estimation of the population of intercepts, which, in return, will inform the estimation of the other intercepts. We call this sharing of information between groups the partial pooling strategy, in comparison with the no pooling strategy, where each intercept is estimated independently, and with the complete pooling strategy, in which all intercepts are given the same value (Gelman et al., 2013 ; Gelman & Hill, 2006 ; McElreath, 2016 b ) . This is one of the most essential features of MLMs, and what leads to better estimations than single-level regression models for repeated measurements or unbalanced sample sizes. This pooling of information is made apparent through the phenomenon of shrinkage , which is illustrated in Figure A.4 , and later on, in Figure A.6 .

Figure A.4 shows the posterior distribution as estimated by this second model for each participant, in relation to the raw mean of its category (i.e., females or males), represented by the vertical dashed lines. We can see for instance that participants M02 and F09 have smaller average distance than the means of their groups, while participants M03 and F08 have larger ones. The arrows represent the amount of shrinkage , that is, the deviation between the mean in the raw data (represented by a cross underneath each density) and the estimated mean of the posterior distribution (represented by the peak of the arrow). As shown in Figure A.4 , this shrinkage is always directed toward the mean of the considered group (i.e., females or males) and the amount of shrinkage is determined by the deviation of the individual mean from its group mean. This mechanism acts like a safeguard against overfitting, preventing the model from overly trusting each individual datum.

Posterior distributions by subject, as estimated by the `bmod2` model. The vertical dashed lines represent the means of the formant distances for the female and male groups. Crosses represent the mean of the raw data, for each participant. Arrows represent the amount of shrinkage, between the raw mean and the estimation of the model (the mean of the posterior distribution).

Figure A.4: Posterior distributions by subject, as estimated by the bmod2 model. The vertical dashed lines represent the means of the formant distances for the female and male groups. Crosses represent the mean of the raw data, for each participant. Arrows represent the amount of shrinkage, between the raw mean and the estimation of the model (the mean of the posterior distribution).

The marginal posterior distribution of each parameter obtained with bmod2 is summarised in Table A.3 , where the Rhat values close to \(1\) suggest that the model has converged. We see that the estimates of \(\alpha\) and \(\beta\) are similar to the estimates of the first model, except that the SE is now slightly larger. This result might seem surprising at first sight, as we expected to improve the first model by adding a by-subject varying intercept. In fact, it reveals an underestimation of the SE when using the first model. Indeed, the first model assumes independence of observations, which is violated in our case. This highlights the general need for careful consideration of the model’s assumptions when interpreting its estimations. The first model seemingly gives highly certain estimates, but these estimations are only valid in the “independence of observations” world (see also the distinction between the small world and the large world in McElreath, 2016 b ) . Moreover, estimating an intercept by subject (as in the second model) increases the precision of estimation, but it also makes the average estimation less certain, thus resulting in a larger SE.

This model ( bmod2 ), however, is still not adequate to describe the data, as the dependency between repetitions of each vowel is not taken into account. In bmod3 , we added a by-vowel varying intercept, thus also allowing each vowel to have a different general level of variability.

\[ \begin{aligned} \text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{vowel[i]} + \beta \times \text{gender}_{i} \\ \alpha_{subj} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\ \alpha_{vowel} &\sim \mathrm{Normal}(0, \sigma_{vowel}) \\ \alpha &\sim \mathrm{Normal}(0, 10) \\ \beta &\sim \mathrm{Normal}(0, 10) \\ \sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{vowel} &\sim \mathrm{HalfCauchy}(10) \\ \end{aligned} \]

This model can be fitted with brms with the following command:

where the same Half-Cauchy is specified for the two varying intercepts, by applying it directly to the sd class.

The marginal posterior distribution of each parameter is summarised in Table A.4 . We can compute the intra-class correlation (ICC, see section A.1.2 ) to estimate the relative variability associated with each varying effect: \(ICC_{subject}\) is equal to 0.034 and \(ICC_{vowel}\) is equal to 0.429. The rather high ICC for vowels suggests that observations are highly correlated within each vowel, thus stressing the relevance of allocating a unique intercept by vowel. 70

A.2.4 Including a correlation between varying intercept and varying slope

One can legitimately question the assumption that the differences between male and female productions are identical for each vowel. To explore this issue, we thus added a varying slope for the effect of gender, allowing it to vary by vowel. Moreover, we can exploit the correlation between the baseline level of variability by vowel, and the amplitude of the difference between males and females in pronouncing them. For instance, we can observe that the pronunciation of /a/ is more variable in general. We might want to know whether females tend to pronounce vowels that are situated at a specific location in the F1-F2 plane with less variability than males. In other words, we might be interested in knowing whether the effect of gender is correlated with the baseline level of variability. This is equivalent to investigating the dependency , or the correlation between the varying intercepts and the varying slopes. We thus estimated this correlation by modelling \(\alpha_{vowel}\) and \(\beta_{vowel}\) as issued from the same multivariate normal distribution (a multivariate normal distribution is a generalisation of the usual normal distribution to more than one dimension), centered on \(0\) and with some covariance matrix \(\textbf{S}\) , as specified on the third line of the following model:

\[ \begin{aligned} \text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{vowel[i]} + (\beta + \beta_{vowel[i]}) \times \text{gender}_{i} \\ \begin{bmatrix} \alpha_{\text{vowel}} \\ \beta_{\text{vowel}} \\ \end{bmatrix} &\sim \mathrm{MVNormal}\bigg(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \textbf{S}\bigg) \\ \textbf{S} &= \begin{pmatrix} \sigma_{\alpha_{vowel}}^{2} & \sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho \\ \sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho & \sigma_{\beta_{vowel}}^{2} \\ \end{pmatrix} \\ \alpha_{subject} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\ \alpha &\sim \mathrm{Normal}(0, 10) \\ \beta &\sim \mathrm{Normal}(0, 10) \\ \sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{\alpha_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{\beta_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\ \textbf{R} &\sim \mathrm{LKJcorr}(2) \\ \end{aligned} \]

where \(\textbf{R}\) is the correlation matrix \(\textbf{R} = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\) and \(\rho\) is the correlation between intercepts and slopes, used in the computation of \(\textbf{S}\) . This matrix is given the LKJ-Correlation prior (Lewandowski, Kurowicka, & Joe, 2009 ) with a parameter \(\zeta\) (zeta) that controls the strength of the correlation. 71 When \(\zeta = 1\) , the prior distribution on the correlation is uniform between \(-1\) and \(1\) . When \(\zeta > 1\) , the prior distribution is peaked around a zero correlation, while lower values of \(\zeta\) ( \(0 < \zeta < 1\) ) allocate more weight to extreme values (i.e., close to -1 and 1) of \(\rho\) (see Figure A.5 ).

Visualisation of the LKJ prior for different values of the shape parameter $\zeta$.

Figure A.5: Visualisation of the LKJ prior for different values of the shape parameter \(\zeta\) .

Estimates of this model are summarised in Table A.5 . This summary reveals a negative correlation between the intercepts and slopes for vowels, meaning that vowels with a large “baseline level of variability” (i.e., with a large average distance value) tend to be pronounced with more variability by females than by males. However, we notice that this model’s estimation of \(\beta\) is even more uncertain than that of the previous models, as shown by the associated standard error and the width of the credible interval.

Figure A.6 illustrates the negative correlation between the by-vowel intercepts and the by-vowel slopes, meaning that vowels that tend to have higher “baseline variability” (i.e., /e/, /o/, /a/), tend to show a stronger effect of gender . This figure also illustrates the amount of shrinkage, here in the parameter space. We can see that the partial pooling estimate is shrunk somewhere between the no pooling estimate and the complete pooling estimate (i.e., the grand mean). This illustrates again the mechanism by which MLMs balance the risk of overfitting and underfitting (McElreath, 2016 b ) .

Shrinkage of estimates in the parameter space, due to the pooling of information between clusters (based on the `bmod4` model). The ellipses represent the contours of the bivariate distribution, at different degrees of confidence 0.1, 0.3, 0.5 and 0.7.

Figure A.6: Shrinkage of estimates in the parameter space, due to the pooling of information between clusters (based on the bmod4 model). The ellipses represent the contours of the bivariate distribution, at different degrees of confidence 0.1, 0.3, 0.5 and 0.7.

A.2.5 Varying intercept and varying slope model, interaction between subject and vowel

So far, we modelled varying effects of subjects and vowels. In this study, these varying factors were crossed, meaning that every subject had to pronounce every vowel. Let us now imagine a situation in which Subject 4 systematically mispronounced the /i/ vowel. This would be a source of systematic variation over replicates which is not considered in the model ( bmod4 ), because this model can only adjust parameters for either vowel or participant, but not for a specific vowel for a specific participant.

In building the next model, we added a varying intercept for the interaction between subject and vowel, that is, we created an index variable that allocates a unique value at each crossing of the two variables (e.g., Subject1-vowel/a/, Subject1-vowel/i/, etc.), resulting in \(8 \times 5 = 40\) intercepts to be estimated (for a review of multilevel modeling in various experimental designs, see Judd, Westfall, & Kenny, 2017 ) . This varying intercept for the interaction between subject and vowel represents the systematic variation associated with a specific subject pronouncing a specific vowel. This model can be written as follows, for any observation \(i\) :

\[ \begin{aligned} \text{distance}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma_{e}) \\ \mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{vowel[i]} + \alpha_{subject:vowel[i]} + (\beta + \beta_{vowel[i]}) \times \text{gender}_{i} \\ \begin{bmatrix} \alpha_{\text{vowel}} \\ \beta_{\text{vowel}} \\ \end{bmatrix} &\sim \mathrm{MVNormal}\bigg(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \textbf{S}\bigg) \\ \textbf{S} &= \begin{pmatrix} \sigma_{\alpha_{vowel}}^{2} & \sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho \\ \sigma_{\alpha_{vowel}}\sigma_{\beta{vowel}} \rho & \sigma_{\beta_{vowel}}^{2} \\ \end{pmatrix} \\ \alpha_{subject} &\sim \mathrm{Normal}(0, \sigma_{subject}) \\ \alpha_{subject:vowel} &\sim \mathrm{Normal}(0, \sigma_{subject:vowel}) \\ \alpha &\sim \mathrm{Normal}(0, 10) \\ \beta &\sim \mathrm{Normal}(0, 10) \\ \sigma_{e} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{subject} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{subject:vowel} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{\alpha_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\ \sigma_{\beta_{vowel}} &\sim \mathrm{HalfCauchy}(10) \\ \textbf{R} &\sim \mathrm{LKJcorr}(2) \\ \end{aligned} \]

This model can be fitted with the following command:

Estimates of this model are summarised in Table A.6 . From this table, we first notice that the more varying effects we add, the more the model is uncertain about the estimation of \(\alpha\) and \(\beta\) , which can be explained in the same way as in section 2.2. Second, we see the opposite pattern for \(\sigma_{e}\) , the residuals standard deviation, which has decreased by a considerable amount compared to the first model, indicating a better fit.

A.3 Model comparison

Once we have built a set of models, we need to know which model is the more accurate and should be used to draw conclusions. It might be a little tricky to select the model that has the better absolute fit on the actual data (using for instance \(R^{2}\) ), as this model will not necessarily perform as well on new data. Instead, we might want to choose the model that has the best predictive abilities, that is, the model that performs the best when it comes to predicting data that have not yet been observed. We call this ability the out-of-sample predictive performance of the model (McElreath, 2016 b ) . When additional data is not available, cross-validation techniques can be used to obtain an approximation of the model’s predictive abilities, among which the Bayesian leave-one-out-cross-validation (LOO-CV, Vehtari, Gelman, & Gabry, 2017 ) . Another useful tool, and asymptotically equivalent to the LOO-CV, is the Watanabe Akaike Information Criterion (WAIC, Watanabe, 2010 ) , which can be conceived as a generalisation of the Akaike Information Criterion (AIC, Akaike, 1974 ) . 72

Both WAIC and LOO-CV indexes are easily computed in brms with the WAIC and the LOO functions, where \(n\) models can be compared with the following call: LOO(model1, model2, ..., modeln) . These functions also provide an estimate of the uncertainty associated with these indexes (in the form of a SE), as well as a difference score \(\Delta \text{LOOIC}\) , which is computed by taking the difference between each pair of information criteria. A comparison of the five models we fitted can be found in Table A.7 .

We see from Table A.7 that bmod5 (i.e., the last model) is performing much better than the other models, as it has the lower LOOIC. We then based our conclusions (see last section) on the estimations of this model. We also notice that each addition to the initial model brought improvement in terms of predictive accuracy, as the set of models is ordered from the first to the last model. This should not be taken as a general rule though, as successive additions made to an original model could also lead to overfitting , corresponding to a situation in which the model is over-specified in regards to the data, which makes the model good to explain the data at hand, but very bad to predict non-observed data. In such cases, information criteria and indexes that rely exclusively on goodness-of-fit (such as \(R^{2}\) ) would point to different conclusions.

A.4 Comparison of brms and lme4 estimations

Figure A.7 illustrates the comparison of brms (Bayesian approach) and lme4 (frequentist approach) estimates for the last model ( bmod5 ), fitted in lme4 with the following command.

Densities represent the posterior distribution as estimated by brms along with 95% credible intervals, while the crosses underneath represent the maximum likelihood estimate (MLE) from lme4 along with 95% confidence intervals, obtained with parametric bootstrapping.

Comparison of estimations from `brms` and `lme4`. Dots represent means of posterior distribution along with 95\% CrIs, as estimated by the `bmod5` model. Crosses represent estimations of `lme4` along with bootstrapped 95\% CIs.

Figure A.7: Comparison of estimations from brms and lme4 . Dots represent means of posterior distribution along with 95% CrIs, as estimated by the bmod5 model. Crosses represent estimations of lme4 along with bootstrapped 95% CIs.

We can see that the estimations of brms and lme4 are for the most part very similar. The differences we observe for \(\sigma_{\alpha_{vowel}}\) and \(\sigma_{\beta_{vowel}}\) might be explained by the skewness of the posterior distribution. Indeed, in these cases (i.e., when the distribution is not symmetric), the mode of the distribution would better coincide with the lme4 estimate. This figure also illustrates a limitation of frequentist MLMs that we discussed in the first part of the current paper. If we look closely at the estimates of lme4 , we can notice that the MLE for the correlation \(\rho\) is at its boundary, as \(\rho = -1\) . This might be interpreted in (at least) two ways. The first interpretation is what Eager & Roy ( 2017 ) call the parsimonious convergence hypothesis (PCH) and consists in saying that this aberrant estimation is caused by the over-specification of the random structure (e.g., Bates et al., 2015 ) . In other words, this would correspond to a model that contains too many varying effects to be “supported” by a certain dataset (but this does not mean that with more data, this model would not be a correct model). However, the PCH has been questioned by Eager & Roy ( 2017 ) , who have shown that under conditions of unbalanced datasets, non-linear models fitted with lme4 provided more prediction errors than Bayesian models fitted with Stan . The second interpretation considers failures of convergence as a problem of frequentist MLMs per se , which is resolved in the Bayesian framework by using weakly informative priors (i.e., the LKJ prior) for the correlation between varying effects (e.g., Eager & Roy, 2017 ; Nicenboim & Vasishth, 2016 ) , and by using the full posterior for inference.

One feature of the Bayesian MLM in this kind of situation is to provide an estimate of the correlation that incorporates the uncertainty caused by the weak amount of data (i.e., by widening the posterior distribution). Thus, the brms estimate of the correlation coefficient has its posterior mean at \(\rho = -0.433\) , but this estimate comes with a huge uncertainty, as expressed by the width of the credible interval ( \(95\% \ \text{CrI}\ [-0.946,0.454]\) ).

A.5 Inference and conclusions

Regarding our initial question, which was to know whether there is a gender effect on vowel production variability in standard Indonesian, we can base our conclusions on several parameters and indices. However, the discrepancies between the different models we fitted deserve some discussion first. As already pointed out previously, if we had based our conclusions on the results of the first model (i.e., the model with constant effects only), we would have confidently concluded on a positive effect of gender. However, when we included the appropriate error terms in the model to account for repeated measurements by subject and by vowel, as well as for the by-vowel specific effect of gender, the large variability of this effect among vowels led the model to adjust its estimation of \(\beta\) , resulting in more uncertainty about it. The last model then estimated a value of \(\beta =\) -0.042 with quite a large uncertainty ( \(95 \% \ \text{CrI} =\) [-0.102, 0.018]), and considering \(0\) as well as some positive values as credible. This result alone makes it difficult to reach any definitive conclusion concerning the presence or absence of a gender effect on the variability of vowels pronunciation in Indonesian, and should be considered (at best) as suggestive.

Nevertheless, it is useful to recall that in the Bayesian framework, the results of our analysis is a (posterior) probability distribution which can be, as such, summarised in multiple ways. This distribution is plotted in Figure A.8 , which also shows the mean and the 95% CrI, as well as the proportion of the distribution below and above a particular value. 73 This figure reveals that around \(94\%\) of the distribution is below \(0\) , which can be interpreted as suggesting that there is a \(0.94\) probability that males have a lower mean formant distance than females (recall that female was coded as -0.5 and male as 0.5), given the data at hand, and the model.

Histogram of posterior samples of the slope for gender, as estimated by the last model.

Figure A.8: Histogram of posterior samples of the slope for gender, as estimated by the last model.

This quantity can be easily computed from the posterior samples:

Of course, this estimate can (and should) be refined using more data from several experiments, with more speakers. In this line, it should be pointed out that brms can easily be used to extend the multilevel strategy to meta-analyses (e.g., Bürkner, Williams, Simmons, & Woolley, 2017 ; Williams & Bürkner, 2017 ) . Its flexibility makes it possible to fit multilevel hierarchical Bayesian models at two, three, or more levels, enabling researchers to model the heterogeneity between studies as well as dependencies between experiments of the same study, or between studies carried out by the same research team. Such a modelling strategy is usually equivalent to the ordinary frequentist random-effect meta-analysis models, while offering all the benefits inherent to the Bayesian approach.

Another useful source of information comes from the examination of effects sizes. One of the most used criteria is Cohen’s \(d\) standardized effect size, that expresses the difference between two groups in terms of their pooled standard deviation:

\[ \text{Cohen's d} = \frac{\mu_{1}-\mu_{2}}{\sigma_{pooled}}=\frac{\mu_{1}-\mu_{2}}{\sqrt{\frac{\sigma_{1}^{2}+\sigma_{2}^{2}}{2}}} \]

However, as the total variance is partitioned into multiple sources of variation in MLMs, there is no unique way of computing a standardised effect size. While several approaches have been suggested (e.g., dividing the mean difference by the standard deviation of the residuals), the more consensual one involves taking into account all of the variance sources of the model (Hedges, 2007 ) . One such index is called the \(\delta_{t}\) (where the \(t\) stands for “total”), and is given by the estimated difference between group means, divided by the square root of the sum of all variance components:

\[ \delta_{t} = \frac{\beta}{\sqrt{\sigma_{subject}^{2} + \sigma_{subject:vowel}^{2} + \sigma_{\alpha_{vowel}}^{2} + \sigma_{\beta_{vowel}}^{2} + \sigma^{2}}} \]

As this effect size is dependent on the parameters estimated by the model, one can derive a probability distribution for this index as well. This is easily done in R , computing it from the posterior samples:

This distribution is plotted in Figure A.9 , and reveals the large uncertainty associated with the estimation of \(\delta_{t}\) .

Posterior distribution of $\delta_{t}$.

Figure A.9: Posterior distribution of \(\delta_{t}\) .

In the same fashion, undirected effect sizes (e.g., \(R^{2}\) ) can be computed directly from the posterior samples, or included in the model specification as a parameter of the model, in a way that at each iteration of the MCMC, a value of the effect size is sampled, resulting in an estimation of its full posterior distribution. 74 A Bayesian version of the \(R^{2}\) is also available in brms using the bayes_R2 method, for which the calculations are based on Gelman, Goodrich, Gabry, & Vehtari ( 2018 ) .

In brief, we found a weak effect of gender on vowel production variability in Indonesian ( \(\beta =\) -0.042, \(\ 95 \% \ \text{CrI} =\) [-0.102, 0.018], \(\ \delta_{t} =\) -0.345, \(\ 95 \% \ \text{CrI} = [-0.807, 0.097]\) ), this effect being associated with a large uncertainty (as expressed by the width of the credible interval). This result seems to show that females tend to pronounce vowels with more variability than males, while the variation observed across vowels (as suggested by \(\sigma_{\beta_{vowel}}\) ) suggests that there might exist substantial inter-vowel variability, that should be subsequently properly studied. A follow-up analysis specifically designed to test the effect of gender on each vowel should help better describe inter-vowel variability (we give an example of such an analysis in the supplementary materials ).

To sum up, we hope that this introductive tutorial has helped the reader to understand the foundational ideas of Bayesian MLMs, and to appreciate how straightforward the interpretation of the results is. Moreover, we hope to have demonstrated that although Bayesian data analysis may still sometimes (wrongfully) sound difficult to grasp and to use, the development of recent tools like brms helps to build and fit Bayesian MLMs in an intuitive way. We believe that this shift in practice will allow more reliable statistical inferences to be drawn from empirical research.

A.6 Supplementary materials

Supplementary materials, reproducible code and figures are available at: https://osf.io/dpzcb . A lot of useful packages have been used for the writing of this paper, among which the papaja and knitr packages for writing and formatting (Aust & Barth, 2018 ; Xie, 2018 ) , the ggplot2 , viridis , ellipse , BEST , and ggridges packages for plotting (Garnier, 2018 ; Kruschke & Meredith, 2018 ; Murdoch & Chow, 2018 ; Wickham et al., 2018 ; Wilke, 2018 ) , as well as the tidyverse and broom packages for code writing and formatting (Robinson, 2018 ; Wickham, 2017 ) .

This chapter is a published paper reformatted for the need of this thesis. Source: Nalborczyk, L., Batailler, C., Lvenbruck, H., Vilain, A., & Bürkner, P.-C. (2019). An introduction to Bayesian multilevel models using brms: A case study of gender effects on vowel variability in standard Indonesian. Journal of Speech, Language, and Hearing Research, 62 (5), 1225-1242. https://doi.org/10.1044/2018_JSLHR-S-18-0006 . ↩

In this context, the maximal varying effect structure means that any potential source of systematic influence should be explicitly modelled, by adding appropriate varying effects. ↩

Note that MLMs are sometimes called mixed models , as models that comprise both fixed and random effects. ↩

Acknowledging that these individual intercepts can also be seen as adjustments to the grand intercept \(\alpha\) , that are specific to group \(j\) . ↩

Note that –for the sake of simplicity– throughout this tutorial we use a Normal likelihood, but other (better) alternatives would include using skew-normal or log-normal models, which are implemented in brms with the skew_normal and lognormal families. We provide examples in the supplementary materials . ↩

Where a credible interval is the Bayesian analogue of a classical confidence interval, except that probability statements can be made based upon it (e.g., “given the data and our prior assumptions, there is a 0.95 probability that this interval encompasses the population value \(\theta\) ”). ↩

But please note that we do not mean to suggest that the varying intercept for subjects should be removed because its ICC is low. ↩

The LKJ prior is the default prior for correlation matrices in brms . ↩

More details on model comparison using cross-validation techniques can be found in Nicenboim & Vasishth ( 2016 ) . See also Gelman, Hwang, & Vehtari ( 2014 ) for a complete comparison of information criteria. ↩

We compare the distribution with \(0\) here, but it should be noted that this comparison could be made with whatever value. ↩

See for instance Gelman & Pardoe ( 2006 ) , for measures of explained variance in MLMs and Marsman, Waldorp, Dablander, & Wagenmakers ( 2019 ) , for calculations in ANOVA designs. ↩

An Introduction to brms

Using brms for parameter estimation: A walkthrough

I am using the native pipe operator, which is new to R 4.10. This pipe operator is written as a | followed by a > . In this document, the operator is printed as |> , due to the fact that I am using font ligatures. If the pipe doesn’t work for you, simply replace it with the older pipe %>% .

In this post, I’ll show how to use brms to infer the means of two independent normally distributed samples. I’ll try to follow the steps illustrated in the previous post on a principled Bayesian workflow .

Generate data

First, we’ll generate two independent normally distributed samples. These will correspond to two levels of a grouping variable, so let’s call them group A and group B.

Group A will have a mean \(\mu_A = 20\) and a standard deviation \(\sigma_A = 2\) , whereas group B have have the parameters \(\mu_B = 16\) and \(\sigma_B = 1.5\) .

We now draw 10 observations for each group.

Since we know the true values that generated the data, we know whether we will be able to successfully recover them. Of course, the sample means and standard deviations will differ slightly from the true values.

Probabilistic model

We assume the data are conditionally normally distributed

\[ y_i \sim \mathcal{N}(\mu_{[j]}, \sigma_{[j]}) \] \[ \text{for J = 1, 2} \]

We will initially assume that the two groups have equal standard deviations (SD), so that we need only estimate one common SD parameter. We therefore need to estimate three parameters in total, \(\mu_a, \mu_b, \sigma\) (you can allow both mean and standard deviation to vary in assignment 1 ).

Linear model

Using a linear model, we have several possibilities for choosing our contrast coding. We will use treatment coding, in which we choose one of the groups as reference category. This will be represented by the intercept. The other group will not be estimated directly. Instead, the second parameter will represent the difference between this group and the reference category.

We can check the levels of the grouping variable. The first levels will be chose as the reference group.

Another possibility is to omit the intercept, and then just estimate both group means independently.

For the first approach, we use the R formula

For the second parameterization, we write

Prior distributions

We can check which for which parameters we need to set priors, and what the default priors are, using the get_prior() function.

The output doesn’t look very appealing, so we can show just the first four columns:

The three parameters are groupB , represents the difference between group B and the reference category, Intercept , which represents group A, and sigma , the common standard deviation.

Both Intercept and sigma are given Student-t priors. The first parameter of this distribution can be considered as a “normality” parameter—the higher this is, the more normal the distribution looks. The prior on the intercept has a mean of 16.9, which is based on the median of the response variable ( median(d$score) ) and a standard devation of 2.5. The default priors are guesses to ensure that the posterior is in the raight range, while making it unlikely that the prior biases the inferences.

Something that is not apparent is that the prior on sigma is actually a folded Student-t distribution—this means that the distribution is folded in half, because the parameter sigma is constrained to be positive (a standard deviation must \(>0\) .

The prior on the groupB parameter is flat. This is basically never a good idea—you should always choose your own prior, instead of using the default flat prior.

For the second parameterization, we get

Here, we get the same statndard deviation parameter, but instead of an intercept we get two parameters, one for each level of the grouping variable. Both have flat priors.

One important difference between the two is that for the second parameterization, both levels are treated in the same manner, whereas for the first approach, the reference get a prior, and the non-reference category is coded as Intercept + groupB . There the mean of group B will be estimated with more uncertainty that that of group A. While this makes sense for hypothesis testing, for estimation this is questionable. McElreath ( 2020 ) generally recommends the second approach.

We will ignore McElreath’s advice for now, and estimate mean of group B as Intercept + groupB .

Since we already know from the summary above that the difference between groups cannot tbe very large, we set a normal(0, 4) on the group difference. This expresses the belief that we are about 95% certain that the parameter will lie between \(-8\) and \(8\)

We can use the brms function prior() to do this.

The priors on the intercept and and group difference look like this:

characteristics of hypothesis in brm

Prior predictive distribution

In order to get the prior predictive distribution, we can first sample from the prior distributions using the sample_prior argument set to "only" . If we do this, we are running the same model that we will later use to obtain the posterior distribution, but we are ignoring the data.

This model will do three things: 1) provide prior distributions of the parameters, 2) provide distributions of the conditional means, i.e. the values of the linear predictor and 3) provide samples from the prior predictive distribution.

We can visualize the distribution of parameter values that our model expects using the mcmc_plots() function.

characteristics of hypothesis in brm

These distributions just reflect the prior distributions, i.e. they are sampled from each parameter’s prior distribution. It is very helpful, though, to plot the conditional means, i.e. the expected means conditioned on group membership.

characteristics of hypothesis in brm

Both groups are expected to have similar means, because that is what we expressed with our prior distribution on the group difference.

Prior predictive checks

We can then add additional variance by incorporating the residual error. This can be achieved by using the posterior_predict() function and then processing the output; however, it is often far simpler to use the built-in function pp_check() (the pp stand for posterior predictive). This function cab perform a variety of posterior predictive checks; here we are simply plotting the density of the data ( \(y\) ) along with densitites obtained from generated data ( \(y_{rep}\) ).

If we sample from the posterior, then pp_check() performs posterior predictive checks. If we sample from the prior only, then pp_check() performs prior predictive checks.

This plot can give us a good idea of what kind of data our model expects, and we can compare those to the actual data obtained

characteristics of hypothesis in brm

We can also group by our grouping variable to compare the generated data separately by group.

characteristics of hypothesis in brm

Posterior inference

If we are happy with our model, we can sample from the posterior, using the same model from above, but ommitting the sample_prior argument. As above, brms generated Stan code, which is then compiled to C++. Once the model is compiled, Stan runs 4 independent Markov chains, each of which will explore the posterior distribution. The number of chains can be specified, but it is rarely necesarry to change the default setting of 4.

It is a good idea to use as many cores as possible. Modern computers have multi-core processors. This means that Stan will make use of as many cores as it can, and run the chains in parallel. This will result in a huge speed-up. You can use the argument cores = parallel::detectCores() inside brm() to set this. It advisable to set this in the R options, so that you do have to do this every time you call brm() .

Before we look at the parameter estimates, it essential to check that the 4 chains have converged, i.e. that they are sampling from the same posterior. Calling the plot() method on the fitted object will plot traceplots (on the right of the plot), which are the estimates (on the y axis) plotted against the sample number.

characteristics of hypothesis in brm

Another way of getting these is with the function mcmc_trace() from the bayesplot package.

characteristics of hypothesis in brm

The plots for each parameter show the 4 chains (in different shades of blue). They should not be easily distinguishable from each other, and should resemble a “fat hairy caterpillar.”

Apart from visual inspection, we can check for convergence of the chains by looking at the Rhat values in the output. There is one for each estimated parameter, and these values should be approximately \(1.0\) , and \(> 1.05\) . The Rhat statistic measures the ratio of the average variance of samples within each chain to the variance of the pooled samples across chains. If these values are \(1.0\) , this means that the chains have mixed well and are sampling from the same posterior.

We can now look at the estimated parameters. Here, we get Population-Level Effects and Family Specific Parameters . The population-level effects are our intercept and group difference, the fFamily-specific parameter is the residual standard deviation. For each parameter we are shown the mean of the posterior distribution ( Estimate ), the standard deviation of the posterior distribution ( Est.Error ) as well as two-sided 95% credible intervals(l-95% CI and u-95% CI) based on quantiles. The Bulk and Tail ESS (expected sample size) are estimates of how many independent draws would contain the same amount of information as the correlated draws of the posterior (Markov chains obtain correlated draws).

Parameter estimates

The three parameters are Intercept , groupB and sigma . The latter represents the stndard deviation, which according to our model is not allowd to vary between groups (our model is thus mis-specified, as we know that sigma differs between groups.) The posterior is mean is 1.7, with a 95% CI of [1.24, 2.4]. We are thus 95% certain the the standard deviation lies within that interval.

Intercept and groupB reprsent the expected mean of the reference group, which is A in this case, and the difference between groups, respectively.

The intercept has a mean of 18.99, with a 95% CI of [17.94, 20.02], and the difference between groups has a mean of -3.24 with a 95% CI of [-4.68, 1.72].

These are merely summaries of the posterior distributions. It is also very important to look at the full posterior distributions. These can be plotted with the function mcmc_plot() .

characteristics of hypothesis in brm

We can also choose which parameters to plot:

characteristics of hypothesis in brm

Conditional means

A simple way to obtain the predicted conditional means is to use the the add_fitted_draws() from the tidybayes package.

This requires a grid of values for which we want the conditional means. In the case we use the data_grid() function from the modelr package to create this.

We can the use add_fitted_draws() to obtain the values of the linear predictor, which in this case will be either the Intercept for group A, or Intercept + groupB for group B.

These can then be plotted using the stat_pointinterval() function, which takes a .width argument to specify the width of the credible interval.

characteristics of hypothesis in brm

Posterior predictive check

Similarly to above, we can use pp_check() , which will now perform psterior predictive check (because we have sampled from the posterior).

characteristics of hypothesis in brm

It is apparent the while our model can adequately represent the shape of the data, the predictions vary quite a lot, which is due to there not being enough data (this is only a toy model, after all).

Using the functions from the tidybayes package, we can plot the conditional exptected means, the posterior predictions for the data along with the actual data, all in one plot.

characteristics of hypothesis in brm

The blue band shows the posterior predictive density for new data (what data does our model predict?), the black dots within the blue bands show the actual data points, and the intervals underneath show the expected conditional means (values of the linear predictor).

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY 4.0 . Source code is available at https://github.com/awellis/learnmultilevelmodels , unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

BibTeX citation

Rens van de Schoot

R Linear Regression Bayesian (using brms)

R regression bayesian (using brms), by laurent smeets and rens van de schoot, last modified: 21 august 2019.

This tutorial provides the reader with a basic tutorial how to perform a Bayesian regression in brms , using Stan instead of as the MCMC sampler. Throughout this tutorial, the reader will be guided through importing data files, exploring summary statistics and regression analyses. Here, we will exclusively focus on Bayesian statistics.

In this tutorial, we start by using the default prior settings of the software. In a second step, we will apply user-specified priors, and if you really want to use Bayes for your own data, we recommend to follow the WAMBS-checklist , also available in other software.

We are continuously improving the tutorials so let me know if you discover mistakes, or if you have additional resources I can refer to. The source code is available via Github . If you want to be the first to be informed about updates, follow me on  Twitter .

Preparation

This tutorial expects:

  • Installation of STAN and Rtools . For more information please see https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
  • Installation of R packages rstan , and brms . This tutorial was made using brms version 2.9.0 in R version 3.6.1
  • Basic knowledge of hypothesis testing
  • Basic knowledge of correlation and regression
  • Basic knowledge of Bayesian inference
  • Basic knowledge of coding in R

Example Data

The data we will be using for this exercise is based on a study about predicting PhD-delays ( Van de Schoot, Yerkes, Mouw and Sonneveld 2013 ).The data can be downloaded here . Among many other questions, the researchers asked the Ph.D. recipients how long it took them to finish their Ph.D. thesis (n=333). It appeared that Ph.D. recipients took an average of 59.8 months (five years and four months) to complete their Ph.D. trajectory. The variable B3_difference_extra measures the difference between planned and actual project time in months (mean=9.97, minimum=-31, maximum=91, sd=14.43). For more information on the sample, instruments, methodology and research context we refer the interested reader to the paper.

For the current exercise we are interested in the question whether age (M = 31.7, SD = 6.86) of the Ph.D. recipients is related to a delay in their project.

The relation between completion time and age is expected to be non-linear. This might be due to that at a certain point in your life (i.e., mid thirties), family life takes up more of your time than when you are in your twenties or when you are older.

So, in our model the \(gap\) ( B3_difference_extra ) is the dependent variable and \(age\) ( E22_Age ) and \(age^2\) ( E22_Age_Squared ) are the predictors. The data can be found in the file phd-delays.csv .

Question: Write down the null and alternative hypotheses that represent this question. Which hypothesis do you deem more likely?

\(H_0:\) \(age\) is not related to a delay in the PhD projects.

\(H_1:\) \(age\) is related to a delay in the PhD projects.

\(H_0:\) \(age^2\) is not related to a delay in the PhD projects.

\(H_1:\) \(age^2\) is related to a delay in the PhD projects.

Preparation – Importing and Exploring Data

You can find the data in the file phd-delays.csv , which contains all variables that you need for this analysis. Although it is a .csv-file, you can directly load it into R using the following syntax:

Alternatively, you can directly download them from GitHub into your R work space using the following command:

GitHub is a platform that allows researchers and developers to share code, software and research and to collaborate on projects (see https://github.com/ )

Once you loaded in your data, it is advisable to check whether your data import worked well. Therefore, first have a look at the summary statistics of your data. you can do this by using the describe() function.

Question: Have all your data been loaded in correctly? That is, do all data points substantively make sense? If you are unsure, go back to the .csv-file to inspect the raw data.

The descriptive statistics make sense:

diff: Mean (9.97), SE (0.791)

\(Age\) : Mean (31.68), SE (0.38)

\(Age^2\) : Mean (1050.22), SE (35.97)

Before we continue with analyzing the data we can also plot the expected relationship.

characteristics of hypothesis in brm

Regression – Default Priors

In this exercise you will investigate the impact of Ph.D. students’ \(age\) and \(age^2\) on the delay in their project time, which serves as the outcome variable using a regression analysis (note that we ignore assumption checking!).

As you know, Bayesian inference consists of combining a prior distribution with the likelihood obtained from the data. Specifying a prior distribution is one of the most crucial points in Bayesian inference and should be treated with your highest attention (for a quick refresher see e.g.  Van de Schoot et al. 2017 ). In this tutorial, we will first rely on the default prior settings, thereby behaving a ‘naive’ Bayesians (which might not always be a good idea).

To run a multiple regression with brms, you first specify the model, then fit the model and finally acquire the summary (similar to the frequentist model using lm() ). The model is specified as follows:

  • A dependent variable we want to predict.
  • A “~”, that we use to indicate that we now give the other variables of interest. (comparable to the ‘=’ of the regression equation).
  • The different independent variables separated by the summation symbol ‘+’.
  • Finally, we insert that the dependent variable has a variance and that we want an intercept.
  • We do set a seed to make the results exactly reproducible.

There are many other options we can select, such as the number of chains how many iterations we want and how long of a warm-up phase we want, but we will just use the defaults for now.

For more information on the basics of brms, see the website and vignettes

The following code is how to specify the regression model:

Now we will have a look at the summary by using summary(model) or posterior_summary(model) for more precise estimates of the coefficients

The results that stem from a Bayesian analysis are genuinely different from those that are provided by a frequentist model.

The key difference between Bayesian statistical inference and frequentist statistical methods concerns the nature of the unknown parameters that you are trying to estimate. In the frequentist framework, a parameter of interest is assumed to be unknown, but fixed. That is, it is assumed that in the population there is only one true population parameter, for example, one true mean or one true regression coefficient. In the Bayesian view of subjective probability, all unknown parameters are treated as uncertain and therefore are be described by a probability distribution. Every parameter is unknown, and everything unknown receives a distribution.

This is why in frequentist inference, you are primarily provided with a point estimate of the unknown but fixed population parameter. This is the parameter value that, given the data, is most likely in the population. An accompanying confidence interval tries to give you further insight in the uncertainty that is attached to this estimate. It is important to realize that a confidence interval simply constitutes a simulation quantity. Over an infinite number of samples taken from the population, the procedure to construct a (95%) confidence interval will let it contain the true population value 95% of the time. This does not provide you with any information how probable it is that the population parameter lies within the confidence interval boundaries that you observe in your very specific and sole sample that you are analyzing.

In Bayesian analyses, the key to your inference is the parameter of interest’s posterior distribution. It fulfils every property of a probability distribution and quantifies how probable it is for the population parameter to lie in certain regions. On the one hand, you can characterize the posterior by its mode. This is the parameter value that, given the data and its prior probability, is most probable in the population. Alternatively, you can use the posterior’s mean or median. Using the same distribution, you can construct a 95% credibility interval, the counterpart to the confidence interval in frequentist statistics. Other than the confidence interval, the Bayesian counterpart directly quantifies the probability that the population value lies within certain limits. There is a 95% probability that the parameter value of interest lies within the boundaries of the 95% credibility interval. Unlike the confidence interval, this is not merely a simulation quantity, but a concise and intuitive probability statement. For more on how to interpret Bayesian analysis, check Van de Schoot et al. 2014 .

Question: Interpret the estimated effect, its interval and the posterior distribution.

\(Age\) seems to be a relevant predictor of PhD delays, with a posterior mean regression coefficient of 2.67, 95% Credibility Interval [1.53, 3.83]. Also, \(age^2\) seems to be a relevant predictor of PhD delays, with a posterior mean of -0.0259, and a 95% credibility Interval of [-0.038, -0.014]. The 95% Credibility Interval shows that there is a 95% probability that these regression coefficients in the population lie within the corresponding intervals, see also the posterior distributions in the figures below. Since 0 is not contained in the Credibility Interval we can be fairly sure there is an effect.

Question: Every Bayesian model uses a prior distribution. Describe the shape of the prior distributions of the regression coefficients.

To check which default priors are being used by brms, you can use the prior_summary() function or check the brms documentation , which states that, “The default prior for population-level effects (including monotonic and category specific effects) is an improper flat prior over the reals” This means, that there an uninformative prior was chosen.

We also see that a student-t distribution was chosen for the intercept.

Regression – User-specified Priors

In brms, you can also manually specify your prior distributions. Be aware that usually, this has to be done BEFORE peeking at the data, otherwise you are double-dipping (!). In theory, you can specify your prior knowledge using any kind of distribution you like. However, if your prior distribution does not follow the same parametric form as your likelihood, calculating the model can be computationally intense. Conjugate priors avoid this issue, as they take on a functional form that is suitable for the model that you are constructing. For your normal linear regression model, conjugacy is reached if the priors for your regression parameters are specified using normal distributions (the residual variance receives an inverse gamma distribution, which is neglected here). In brms, you are quite flexible in the specification of informative priors.

Let’s re-specify the regression model of the exercise above, using conjugate priors. We leave the priors for the intercept and the residual variance untouched for the moment. Regarding your regression parameters, you need to specify the hyperparameters of their normal distribution, which are the mean and the variance. The mean indicates which parameter value you deem most likely. The variance expresses how certain you are about that. We try 4 different prior specifications, for both the \(\beta_{age}\) regression coefficient, and the \(\beta_{age^2}\) coefficient.

First, we use the following prior specifications:

\(Age\) ~ \(\mathcal{N}(3, 0.4)\)

\(Age^2\) ~ \(\mathcal{N}(0, 0.1)\)

In brms, the priors are set using the set_prior() function. Be careful, Stan uses standard deviations instead of variance in the normal distribution. The standard deviations is the square root of the variance, so a variance of 0.1 corresponds to a standard deviation of 0.316 and a variance of 0.4 corresponds to a standard deviation of 0.632.

The priors are presented in code as follows:

Now we can run the model again, but with the prior= included. Now fit the model again and request for summary statistics.

Question Fill in the results in the table below:

Next, try to adapt the code, using the prior specifications of the other columns and then complete the table.

Question : Compare the results over the different prior specifications. Are the results comparable with the default model?

Question : do we end up with similar conclusions, using different prior specifications.

To answer these questions, proceed as follows: We can calculate the relative bias to express this difference. ( \(bias= 100*\frac{(model \; informative\; priors\;-\;model \; uninformative\; priors)}{model \;uninformative \;priors}\) ). In order to preserve clarity we will just calculate the bias of the two regression coefficients and only compare the default (uninformative) model with the model that uses the \(\mathcal{N}(20, .4)\) and \(\mathcal{N}(20, .1)\) priors. Copy Paste the following code to R:

The b_age and b_age2 indices stand for the \(\beta_{age}\) and \(\beta_{age^2}\) respectively.

We can also plot these differences by plotting both the posterior and priors for the five different models we ran. In this example we only plot the regression of coefficient of age \(\beta_{age}\)

First we extract the MCMC chains of the 5 different models for only this one parameter ( \(\beta_{age}\) =beta[1,2,1]). Copy-past the following code to R:

instead of sampling the priors like this, you could also get the actual prior values sampled by Stan by adding the sample_prior = TRUE command to the brm() function, this would save the priors as used by stan. With enough samples this would yield the same results

Then, we can plot the different posteriors and priors by using the following code:

characteristics of hypothesis in brm

Now, with the information from the table, the bias estimates and the plot you can answer the two questions about the influence of the priors on the results.

We see that the influence of this highly informative prior is around 386% and 406% on the two regression coefficients respectively. This is a large difference and we thus certainly would not end up with similar conclusions.

The results change with different prior specifications, but are still comparable. Only using \(\mathcal{N}(20, .4)\) for age, results in a really different coefficients, since this prior mean is far from the mean of the data, while its variance is quite certain. However, in general the other results are comparable. Because we use a big dataset the influence of the prior is relatively small. If one would use a smaller dataset the influence of the priors are larger. To check this you can use these lines to sample roughly 20% of all cases and redo the same analysis. The results will of course be different because we use many fewer cases (probably too few!). Use this code.

We made a new dataset with randomly chosen 60 of the 333 observations from the original dataset. You can repeat the analyses with the same code and only changing the name of the dataset to see the influence of priors on a smaller dataset. Run the model model.informative.priors2 with this new dataset

If you really want to use Bayes for your own data, we recommend to follow the WAMBS-checklist , which you are guided through by this exercise.

Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E.,… Johnson, V. (2017, July 22). Redefine statistical significance . Retrieved from psyarxiv.com/mky9j

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N. Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations . European Journal of Epidemiology 31 (4 ). https://doi.org/10.1007/s10654-016-0149-3

Hoffman, M. D., & Gelman, A. (2014) . The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593-1623.

van de Schoot R, Yerkes MA, Mouw JM, Sonneveld H (2013) What Took Them So Long? Explaining PhD Delays among Doctoral Candidates . PLoS ONE 8(7): e68839. https://doi.org/10.1371/journal.pone.0068839

Trafimow D, Amrhein V, Areshenkoff CN, Barrera-Causil C, Beh EJ, Bilgi? Y, Bono R, Bradley MT, Briggs WM, Cepeda-Freyre HA, Chaigneau SE, Ciocca DR, Carlos Correa J, Cousineau D, de Boer MR, Dhar SS, Dolgov I, G?mez-Benito J, Grendar M, Grice J, Guerrero-Gimenez ME, Guti?rrez A, Huedo-Medina TB, Jaffe K, Janyan A, Karimnezhad A, Korner-Nievergelt F, Kosugi K, Lachmair M, Ledesma R, Limongi R, Liuzza MT, Lombardo R, Marks M, Meinlschmidt G, Nalborczyk L, Nguyen HT, Ospina R, Perezgonzalez JD, Pfister R, Rahona JJ, Rodr?guez-Medina DA, Rom?o X, Ruiz-Fern?ndez S, Suarez I, Tegethoff M, Tejo M, ** van de Schoot R** , Vankov I, Velasco-Forero S, Wang T, Yamada Y, Zoppino FC, Marmolejo-Ramos F. (2017) Manipulating the alpha level cannot cure significance testing – comments on “Redefine statistical significance” PeerJ reprints 5:e3411v1 https://doi.org/10.7287/peerj.preprints.3411v1

Privacy Overview

Estimating Distributional Models with brms

Paul bürkner, introduction.

This vignette provides an introduction on how to fit distributional regression models with brms . We use the term distributional model to refer to a model, in which we can specify predictor terms for all parameters of the assumed response distribution. In the vast majority of regression model implementations, only the location parameter (usually the mean) of the response distribution depends on the predictors and corresponding regression parameters. Other parameters (e.g., scale or shape parameters) are estimated as auxiliary parameters assuming them to be constant across observations. This assumption is so common that most researchers applying regression models are often (in my experience) not aware of the possibility of relaxing it. This is understandable insofar as relaxing this assumption drastically increase model complexity and thus makes models hard to fit. Fortunately, brms uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue.

Suppose we have a normally distributed response variable. Then, in basic linear regression, we specify a predictor term \(\eta_{\mu}\) for the mean parameter \(\mu\) of the normal distribution. The second parameter of the normal distribution – the residual standard deviation \(\sigma\) – is assumed to be constant across observations. We estimate \(\sigma\) but do not try to predict it. In a distributional model, however, we do exactly this by specifying a predictor term \(\eta_{\sigma}\) for \(\sigma\) in addition to the predictor term \(\eta_{\mu}\) . Ignoring group-level effects for the moment, the linear predictor of a parameter \(\theta\) for observation \(n\) has the form

\[\eta_{\theta n} = \sum_{i = 1}^{K_{\theta}} b_{\theta i} x_{\theta i n}\] where \(x_{\theta i n}\) denotes the value of the \(i\) th predictor of parameter \(\theta\) for observation \(n\) and \(b_{\theta i}\) is the \(i\) th regression coefficient of parameter \(\theta\) . A distributional normal model with response variable \(y\) can then be written as

\[y_n \sim \mathcal{N}\left(\eta_{\mu n}, \, \exp(\eta_{\sigma n}) \right)\] We used the exponential function around \(\eta_{\sigma}\) to reflect that \(\sigma\) constitutes a standard deviation and thus only takes on positive values, while a linear predictor can be any real number.

A simple distributional model

Unequal variance models are possibly the most simple, but nevertheless very important application of distributional models. Suppose we have two groups of patients: One group receives a treatment (e.g., an antidepressive drug) and another group receives placebo. Since the treatment may not work equally well for all patients, the symptom variance of the treatment group may be larger than the symptom variance of the placebo group after some weeks of treatment. For simplicity, assume that we only investigate the post-treatment values.

The following model estimates the effect of group on both the mean and the residual standard deviation of the normal response distribution.

Useful summary statistics and plots can be obtained via

characteristics of hypothesis in brm

The population-level effect sigma_grouptreat , which is the contrast of the two residual standard deviations on the log-scale, reveals that the variances of both groups are indeed different. This impression is confirmed when looking at the conditional_effects of group . Going one step further, we can compute the residual standard deviations on the original scale using the hypothesis method.

We may also directly compare them and plot the posterior distribution of their difference.

characteristics of hypothesis in brm

Indeed, the residual standard deviation of the treatment group seems to larger than that of the placebo group. Moreover the magnitude of this difference is pretty similar to what we expected due to the values we put into the data simulations.

Zero-Inflated Models

Another important application of the distributional regression framework are so called zero-inflated models. These models are helpful whenever there are more zeros in the response variable than one would naturally expect. For example, if one seeks to predict the number of cigarettes people smoke per day and also includes non-smokers, there will be a huge amount of zeros which, when not modeled appropriately, can seriously distort parameter estimates. Here, we consider an example dealing with the number of fish caught by various groups of people. On the UCLA website (), the data are described as follows: “The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.”

As predictors we choose the number of people per group, the number of children, as well as whether the group consists of campers. Many groups may not even try catching any fish at all (thus leading to many zero responses) and so we fit a zero-inflated Poisson model to the data. For now, we assume a constant zero-inflation probability across observations.

Again, we summarize the results using the usual methods.

characteristics of hypothesis in brm

According to the parameter estimates, larger groups catch more fish, campers catch more fish than non-campers, and groups with more children catch less fish. The zero-inflation probability zi is pretty large with a mean of 41%. Please note that the probability of catching no fish is actually higher than 41%, but parts of this probability are already modeled by the Poisson distribution itself (hence the name zero- inflation ). If you want to treat all zeros as originating from a separate process, you can use hurdle models instead (not shown here).

Now, we try to additionally predict the zero-inflation probability by the number of children. The underlying reasoning is that we expect groups with more children to not even try catching fish. Most children are just terribly bad at waiting for hours until something happens. From a purely statistical perspective, zero-inflated (and hurdle) distributions are a mixture of two processes and predicting both parts of the model is natural and often very reasonable to make full use of the data.

characteristics of hypothesis in brm

To transform the linear predictor of zi into a probability, brms applies the logit-link:

\[logit(zi) = \log\left(\frac{zi}{1-zi}\right) = \eta_{zi}\]

The logit-link takes values within \([0, 1]\) and returns values on the real line. Thus, it allows the transition between probabilities and linear predictors.

According to the model, trying to fish with children not only decreases the overall number fish caught (as implied by the Poisson part of the model) but also drastically increases your change of catching no fish at all (as implied by the zero-inflation part) most likely because groups with more children are not even trying.

Additive Distributional Models

In the examples so far, we did not have multilevel data and thus did not fully use the capabilities of the distributional regression framework of brms . In the example presented below, we will not only show how to deal with multilevel data in distributional models, but also how to incorporate smooth terms (i.e., splines) into the model. In many applications, we have no or only a very vague idea how the relationship between a predictor and the response looks like. A very flexible approach to tackle this problems is to use splines and let them figure out the form of the relationship. For illustration purposes, we simulate some data with the mgcv package, which is also used in brms to prepare smooth terms.

The data contains the predictors x0 to x3 as well as the grouping factor fac indicating the nested structure of the data. We predict the response variable y using smooth terms of x1 and x2 and a varying intercept of fac . In addition, we assume the residual standard deviation sigma to vary by a smoothing term of x0 and a varying intercept of fac .

characteristics of hypothesis in brm

This model is likely an overkill for the data at hand, but nicely demonstrates the ease with which one can specify complex models with brms and to fit them using Stan on the backend.

Bayesian Meta-Analysis with R, Stan, and brms

Tilburg University

Introduction

Recently, there’s been a lot of talk about meta-analysis, and here I would just like to quickly show that Bayesian multilevel modeling nicely takes care of your meta-analysis needs, and that it is easy to do in R with the rstan and brms packages. As you’ll see, meta-analysis is a special case of Bayesian multilevel modeling when you are unable or unwilling to put a prior distribution on the meta-analytic effect size estimate.

The idea for this post came from Wolfgang Viechtbauer’s website , where he compared results for meta-analytic models fitted with his great (frequentist) package metafor and the swiss army knife of multilevel modeling, lme4 . It turns out that even though you can fit meta-analytic models with lme4, the results are slightly different from traditional meta-analytic models, because the experiment-wise variances are treated slightly differently.

Here are the packages we’ll use:

Here I’ll only focus on a simple random effects meta-analysis of effect sizes, and will use the same example data as in the aforementioned website . The data are included in the metafor package, and describe the relationship between conscientiousness and medication adherence. The effect sizes are r to z transformed correlations.

We are going to fit a random-effects meta-analysis model to these observed effect sizes and their standard errors. Here’s what this model looks like, loosely following notation from the R package Metafor’s manual (p.6):

\[y_i \sim N(\theta_i, \sigma_i^2)\]

where each recorded effect size, \(y_i\) is a draw from a normal distribution which is centered on that study’s “true” effect size \(\theta_i\) and has standard deviation equal to the study’s observed standard error \(\sigma_i\) .

Our next set of assumptions is that the studies’ true effect sizes approximate some underlying effect size in the (hypothetical) population of all studies. We call this underlying population effect size \(\mu\) , and its standard deviation \(\tau\) , such that the true effect sizes are thus distributed:

\[\theta_i \sim N(\mu, \tau^2)\]

We now have two interesting parameters: \(\mu\) tells us, all else being equal, what I may expect the “true” effect to be, in the population of similar studies. \(\tau\) tells us how much individual studies of this effect vary.

I think it is most straightforward to write this model as yet another mixed-effects model (metafor manual p.6):

\[y_i \sim N(\mu + \theta_i, \sigma^2_i)\]

where \(\theta_i \sim N(0, \tau^2)\) , studies’ true effects are normally distributed with between-study heterogeneity \(\tau^2\) . The reason this is a little confusing (to me at least), is that we know the \(\sigma_i\) s (this being the fact that separates meta-analysis from other more common regression modeling).

Estimation with metafor

Super easy!

Bayesian estimation

So far so good, we’re strictly in the realm of standard meta-analysis. But I would like to propose that instead of using custom meta-analysis software, we simply consider the above model as just another regression model, and fit it like we would any other (multilevel) regression model. That is, using Stan , usually through the brms interface. Going Bayesian allows us to assign prior distributions on the population-level parameters \(\mu\) and \(\tau\) , and we would usually want to use some very mildly regularizing priors. Here we proceed with brms’ default priors (which I print below with the output)

Estimation with brms

Here’s how to fit this model with brms:

These results are the same as the ones obtained with metafor. Note the Student’s t prior distributions, which are diffuse enough not to exert influence on the posterior distribution.

Comparing results

We can now compare the results of these two estimation methods. Of course, the Bayesian method has a tremendous advantage, because it results in a full distribution of plausible values.

characteristics of hypothesis in brm

We can see from the numeric output, and especially the figures, that these modes of inference yield the same numerical results. Keep in mind though, that the Bayesian estimates actually allow you to discuss probabilities, and generally the things that we’d like to discuss when talking about results.

For example, what is the probability that the average effect size is greater than 0.2? About eight percent:

Forest plot

The forest plot displays the entire posterior distribution of each \(\theta_i\) . The meta-analytic effect size \(\mu\) is also displayed in the bottom row. I’ll show a considerable amount of code here so that you can create your own forest plots from brms output:

characteristics of hypothesis in brm

Focus on Moran et al. (1997)’s observed effect size (the empty circle): This is an anomalous result compared to all other studies. One might describe it as incredible , and that is indeed what the bayesian estimation procedure has done, and the resulting posterior distribution is no longer equivalent to the observed effect size. Instead, it is shrunken toward the average effect size. Now look at the table above, this study only had 56 participants, so we should be more skeptical of this study’s observed ES, and perhaps we should then adjust our beliefs about this study in the context of other studies. Therefore, our best guess about this study’s effect size, given all the other studies is no longer the observed mean, but something closer to the average across the studies.

If this shrinkage business seems radical, consider Quine et al. (2012). This study had a much greater sample size (537), and therefore a smaller SE. It was also generally more in line with the average effect size estimate. Therefore, the observed mean ES and the mean of the posterior distribution are pretty much identical. This is also a fairly desirable feature.

The way these different methods are presented (regression, meta-analysis, ANOVA, …), it is quite easy for a beginner, like me, to lose sight of the forest for the trees. I also feel that this is a general experience for students of applied statistics: Every experiment, situation, and question results in a different statistical method (or worse: “Which test should I use?”), and the student doesn’t see how the methods relate to each other. So I think focusing on the (regression) model is key, but often overlooked in favor of this sort of decision tree model of choosing statistical methods ( McElreath 2020 ) .

Accordingly, I think we’ve ended up in a situation where meta-analysis, for example, is seen as somehow separate from all the other modeling we do, such as repeated measures t-tests. In fact I think applied statistics in Psychology may too often appear as an unconnected bunch of tricks and models, leading to confusion and inefficient implementation of appropriate methods.

Bayesian multilevel modeling

As I’ve been learning more about statistics, I’ve often noticed that some technique, applied in a specific set of situations, turns out to be a special case of a more general modeling approach. I’ll call this approach here Bayesian multilevel modeling ( McElreath 2020 ) . If you are forced to choose one statistical method to learn, it should be Bayesian multilevel modeling, because it allows you to do and understand most things, and allows you to see how similar all these methods are, under the hood.

Bayesian Modeling Using Stan

Chapter 2 binomial modeling, 2.1 packages for example, 2.2 example.

Suppose a sample of \(n = 20\) college students are asked if they plan on wearing masks while attending class. Let \(p\) denote the proportion of all students who plan on wearing masks.

2.3 Prior on proportion

Suppose you believe that \(p = 0.40\) and you are 90 percent sure that \(p < 0.60\) .

Use beta.select() from the ProbBayes package to find the shape parameters of the matching beta curve prior.

A beta(4.31, 6.30) prior represents one’s beliefs about the proportion \(p\) .

2.4 Prior on the logit parameter

Since we will writing a model in terms of the logit function \[ \theta = \log \left(\frac{p}{1-p}\right) \]

We want to find a corresponding normal prior on \(\theta\) .

A simple way of doing this is by simulation …

  • Simulate 1000 draws from the beta prior on \(p\) .
  • Compute \(\theta\) on these simulated draws of \(p\) .
  • Find the sample mean and standard deviation of these draws – those will be estimates of the mean and standard deviation of the normal prior on \(\theta\) .

The corresponding prior on the logit parameter \(\theta\) is assumed to be normal with mean \(-0.400\) and standard deviation \(0.654\) .

2.5 Fitting the model

The model is \(y_1, ..., y_{20}\) are a random sample from a Bernoulli distribution with probability \(p\) where \(p\) has the logistic representation.

\[ \log \left(\frac{p}{1-p}\right) = \theta \]

where \(\theta \sim N(-0.400, 0.654)\) .

We put the twenty binary responses in a data frame.

We use the brm() function from the brms package to fit the model.

The plot() function will display a density plot and a trace plot of the intercept \(\theta\) .

characteristics of hypothesis in brm

The summary() function provides summary statistics for \(\theta\) .

The posterior_samples() function will display the simulated draws of \(\theta\) .

2.6 Inferences about the proportion

To obtain a sample of draws from the posterior distribution on \(p\) , one can use the inverse logit transformation on the simulated draws of \(\theta\) .

\[ p = \frac{\exp(\theta)}{1 + \exp(\theta)} \]

The posterior density for \(p\) is found by constructing a density plot of the simulated draws of \(p\) .

characteristics of hypothesis in brm

A 90% posterior interval estimate is found by selecting particular quantiles from the simulated values of \(p\) .

brms Bayesian Regression Models using 'Stan'

  • Package overview
  • Define Custom Response Distributions with brms
  • Estimating Distributional Models with brms
  • Estimating Monotonic Effects with brms
  • Estimating Multivariate Models with brms
  • Estimating Non-Linear Models with brms
  • Estimating Phylogenetic Multilevel Models with brms
  • Handle Missing Values with brms
  • Parameterization of Response Distributions in brms
  • Running brms models with within-chain parallelization
  • add_criterion: Add model fit criteria to model objects
  • add_ic: Add model fit criteria to model objects
  • addition-terms: Additional Response Information
  • add_rstan_model: Add compiled 'rstan' models to 'brmsfit' objects
  • ar: Set up AR(p) correlation structures
  • arma: Set up ARMA(p,q) correlation structures
  • as.brmsprior: Transform into a brmsprior object
  • as.data.frame.brmsfit: Extract Posterior Draws
  • as.mcmc.brmsfit: Extract posterior samples for use with the 'coda' package
  • AsymLaplace: The Asymmetric Laplace Distribution
  • autocor.brmsfit: (Deprecated) Extract Autocorrelation Objects
  • autocor-terms: Autocorrelation structures
  • bayes_factor.brmsfit: Bayes Factors from Marginal Likelihoods
  • bayes_R2.brmsfit: Compute a Bayesian version of R-squared for regression models
  • BetaBinomial: The Beta-binomial Distribution
  • bridge_sampler.brmsfit: Log Marginal Likelihood via Bridge Sampling
  • brm: Fit Bayesian Generalized (Non-)Linear Multivariate Multilevel...
  • brm_multiple: Run the same 'brms' model on multiple datasets
  • brmsfamily: Special Family Functions for 'brms' Models
  • brmsfit-class: Class 'brmsfit' of models fitted with the 'brms' package
  • brmsfit_needs_refit: Check if cached fit can be used.
  • brmsformula: Set up a model formula for use in 'brms'
  • brmsformula-helpers: Linear and Non-linear formulas in 'brms'
  • brmshypothesis: Descriptions of 'brmshypothesis' Objects
  • brms-package: Bayesian Regression Models using 'Stan'
  • brmsterms: Parse Formulas of 'brms' Models
  • car: Spatial conditional autoregressive (CAR) structures
  • coef.brmsfit: Extract Model Coefficients
  • combine_models: Combine Models fitted with 'brms'
  • compare_ic: Compare Information Criteria of Different Models
  • conditional_effects.brmsfit: Display Conditional Effects of Predictors
  • conditional_smooths.brmsfit: Display Smooth Terms
  • control_params: Extract Control Parameters of the NUTS Sampler
  • cor_ar: (Deprecated) AR(p) correlation structure
  • cor_arma: (Deprecated) ARMA(p,q) correlation structure
  • cor_arr: (Defunct) ARR correlation structure
  • cor_brms: (Deprecated) Correlation structure classes for the 'brms'...
  • cor_bsts: (Defunct) Basic Bayesian Structural Time Series
  • cor_car: (Deprecated) Spatial conditional autoregressive (CAR)...
  • cor_cosy: (Deprecated) Compound Symmetry (COSY) Correlation Structure
  • cor_fixed: (Deprecated) Fixed user-defined covariance matrices
  • cor_ma: (Deprecated) MA(q) correlation structure
  • cor_sar: (Deprecated) Spatial simultaneous autoregressive (SAR)...
  • cosy: Set up COSY correlation structures
  • cs: Category Specific Predictors in 'brms' Models
  • custom_family: Custom Families in 'brms' Models
  • data_predictor: Prepare Predictor Data
  • data_response: Prepare Response Data
  • density_ratio: Compute Density Ratios
  • diagnostic-quantities: Extract Diagnostic Quantities of 'brms' Models
  • Browse all...

R/hypothesis.R In brms: Bayesian Regression Models using 'Stan'

Defines functions plot.brmshypothesis print.brmshypothesis round_numeric evidence_ratio density_ratio find_vars eval_hypothesis combine_hlist hypothesis_coef .hypothesis hypothesis.default hypothesis hypothesis.brmsfit, documented in density_ratio hypothesis hypothesis.brmsfit hypothesis.default plot.brmshypothesis print.brmshypothesis, try the brms package in your browser.

Any scripts or data that you put into this service are public.

R Package Documentation

Browse r packages, we want your feedback.

characteristics of hypothesis in brm

Add the following code to your website.

REMOVE THIS Copy to clipboard

For more information on customizing the embed code, read Embedding Snippets .

How to properly compare interacting levels

I am new to brms , trying to figure out its behaviour in details. I was trying to run essentially an anova-like model with random effects (anova-like in the sense that all IVs are factors). For example,

conditional_effects() and hypothesis() are truly amazing analytic tools. However, when you deal with factors and, in particular, their interactions, that’s a bugger… It’s really hard to help students wrap their head around it. In addition to that, I am a bit stubborn and I really despise when people, for practical purposes, re-level their factors to figure out what contrasts are significant. There is no need for that. In frequentist framework, one could make use of Wald’s test etc. Here, I sense, the two functions can be of much use, but still one need to be careful around the intercept. All other levels are relative change to that reference.

If one have a simple dummy variable, 0 + Intercept + Dummy + ... is of help (another nice one!). But when you have factors with more levels and their interactions, things are far from straightforward.

What would be the brms way of making this less painful? Running another model without intercept and then use hypothesis() ? Something else?

  • Operating System: macOS
  • brms Version: 2.14.4

Hey @striatum , it’s difficult to follow your question. What exactly are you trying to make less painful? It might help if you provide a minimal reproducible example.

Apologies for the lack of clarity. Let me try again, this time with an example. (I am simplifying here, ignoring the possibility to have random effects, some covariates etc.)

This gives the summary table (it could be different in another run, given random number generators):

Now, what causes pains to grasp and work through are various comparisons when levels and combination in the table are relative to the reference: F1==A; F2==I; F3==M .

I’ve found it almost funny (and creative too) when some students approached to this by running comparable models with Intercept removed. Given the example above, something like:

And then they apply hypothesis() . For example:

This is rather clumsy and inelegant, if not altogether wrong. Yet, I see the struggle and I’m running out of ideas on how to help… In addition to this, it appears that what hypothesis() tells you “contradicts” what you observe in the plot from conditional_effects() . In fact, is there a way to get the BF directly from the conditional_effects() table?

This is clearer. Thank you.

A few things:

It is possible to have multiple factor variables where each level of each factor is expressed in its own terms. One way is with the non-linear syntax. I have an example, here , model b5.11 . For more on the non-linear syntax, see Paul’s vignette .

Another way is with the multilevel ANOVA approach. I have examples here and possibly more to your interest here with model fit19.1 and here with fit24.1 .

As to contrasts, I typically do these by computing the relevant difference scores by working directly with the output from posterior_samples() or possibly with fitted() . However, if you use the multilevel ANOVA approach, you can use the compare_levels() function from the tidybayes package (see here ).

As far as the hypothesis() approach and Bayes factors, I’m personally not a fan of those methods. Someone else will have to offer guidance with those.

Thanks so much for this! I am trying to make the model work, but I am getting an error message:

Error: Factors with more than two levels are not allowed as covariates. So, following my example above, this is my actual model:

So, it interaction the problem here? Or?

Consider a simplified version of your model from above. Here we’re modeling the criterion y with two factor variables, F1 and F2 , and their interaction.

I concede there might be more elegant ways to express this. But this works.

Thanks so much! May I ask, please, to give me a template that would also contain some random effects?

I haven’t fit a model with that specification, before.

@Solomon , thanks so much! You’ve been super-patient and helpful. To show my gratitude, and give back a bit from what I’ve learned.

This is one of the more (if not the most) elaborate models we’ve been working on recently. And, now, it works under the above specification/approach:

Apart from four factors (fixed effects), there is one covariate that is the order of trial in the experiment. Since it is scaled (z-transformed) then Gaussian prior with s=1.0 is appropriate.

Thanks again!

Related Topics

Bayesian Multilevel Modeling with brms

Created by: Paul A. Bloom extra R

Links to Files

The files for all tutorials can be downloaded from the Columbia Psychology Scientific Computing GitHub page using these instructions . This particular file is located here: /content/tutorials/r-extra/brms/multilevel-models-with-brms.rmd .

0) Goals for this vignette

  • Give a quick intro to multilevel modeling and Bayesian inference
  • Show a use case with Brms and some helpful syntax for demonstrating what this model does
  • Hopefully, convince you to use multilevel modeling and Bayesian approaches for your stats

1) Why multilevel modeling?

Multilevel modeling, also called ‘hierarchical’, or ‘mixed-effects’ modeling is an extrordinarly powerfull tool when we have data with a nested structure!

A few tutorials on multilevel modeling:

  • An awesome visual introduction to multilevel models
  • Tristan Mahr’s Partial Pooling Tutorial Using lme4
  • Our tutorial on plotting multilevel models in ggplot2

Is multilevel modeling hard? Why is it worth my time?

There are some challenges, but in R, constructing multilevel models follows a very similar syntax to non-nested models! While making modeling decisions can be difficult, actually running multilevel models in R is pretty intuitive!

In situations with nested data, I think multilevel modeling is almost always worth it because * these models are powerful in accounting for variance resulting from nested structures in the data * these models in many situations can help us reduce our uncertainty about population-level effects and robustly account for noisy data * these models in many cases best match our intuitions about the true structure of the data

What are some examples of situations in which multilevel modeling might be helpful?

I always like to think about whether the data are nested when addressing this question. If the data have a nested, or hierarchical structure, then multilevel modeling is probably a good idea.

Some examples of data nesting:

  • In a cognitive experiment, subjects each complete many trials of a task. Here trials are nested within subjects
  • In a math intervention program, students in different classes each take a test. Here students are nested within classes
  • We have county-level data on heart disease rates from 2017. Here we could nest counties within states

Why not just compute participant-level summaries, and do stats on these? Aren’t these summaries more stable AND easier to work with?

  • While this can be a good initial approach, we lose information about the data when we compute a summary of any kind
  • While participant-level summaries, such as the mean score across 20 trials within a participant (in the first example above), are indeed more stable than individual trials themselves, this approach assumes that all participants are equally variable and thus should all contribute equally to the results
  • Especially when some participants have more trials than others, we might want to take into account that some participants’ individual estimates are more reliable than others
  • Taking the summary approach can cause us to misinterpret our certainty about effects AND miss real trends in the data.

I’ve heard about partial pooling in multilevel models, what is this? Why would I want to change my estimates of each participant from their TRUE data?

  • Partial pooling is a process by which population-level and individual-level effects are estimated at the same time, and each participants’ estimated effect reflects a weighted combination of their own data and the population average. Becuase this structure considers subjects to be drawn from a population distribution, subjects further away from the population average are ‘pooled’ in.
  • While changing estimates of each participant from the raw data seems scary, if we consider that data usually contain noise, then this process can be viewed as adjusting noisy estimates on the subject level based on what we know about group level trends. If subjects have very precise data, and lots of it, their estimates will be adjusted less. If subjects have less data or more uncertain data, they will be pooled more.
  • Pooling can also help models become much more robust to outliers, and is especially helpful when different participants have different amounts of data. We’ll see a demo of this soon.

2) Multilevel regression model syntax!

Here is the general syntax for modeling in two popular packages, lme4 and brms . In general, this syntax looks very similar to the lm() syntax in R.

In multilevel regression models, we can let different groups (lets say subjects here) have their own intercepts or slopes or both.

Varying Intercepts:

The syntax for letting each subject have their own intercept would be:

outome ~ predictor + (1|subject)

This is basically the same syntax as an OLS regression such as lm(outcome ~ predictor) , except we add the second term to let each subject have their own intercept for this given outcome.

Varying Slopes:

The syntax for letting each subject have their own intercept AND slope would be:

outome ~ predictor + (predictor|subject)

This means that when we are estimating the effect of the predictor on the outcome, we also allow each subject to have their own (simultaneously estimated) effect for the predictor on the outcome, as well as the intercept.

3) Setting up the data

We’ll use a simulated dataset for this vignette, so you don’t need to worry about any dependencies involving datasets you don’t have access to while you’re following along. This dataset was created by Monica Thieu initially in the Tidyguide tutorials, so we’ll use a very similar one here. If you have tidyverse loaded, all this code should run if you try to run it in your R console.

Simulate Data

Raw reaction times for each subject.

characteristics of hypothesis in brm

Fancy tidveryse stuff to make a model for each participant.

See Monica’s extreme tidy bible .

Make a linear model for each participant to look at the effect of is_old , or having seen the stimulus before

Plot the estimated effect of is old for each subject

characteristics of hypothesis in brm

4) Individual Summary (no pooling) and multilevel model comparison!

A model on the summary data with no pooling, okay, now let’s construct the multilevel model, compare multilvel model subject-level estimates for the effect of is_old to those from the no-pooling model.

characteristics of hypothesis in brm

Multilevel model predictions for the age effect

characteristics of hypothesis in brm

Check out the model predictions from the no pool model

characteristics of hypothesis in brm

So, we can see from this example that the use of the multilevel model helps us recover this true effect of age whereas a model based on subject-level summaries is not as robust to outliers.

Is this just because of outliers?

This is a particularly clean example where we KNOW that one subject is an outlier who was not doing the experiment the same way. Usually we don’t have that luxury, so multilevel modeling gives us a big advantage in being able to include all the data and allowing the model to estimate effects in a manner robust to potential outliers.

This is more of a philosophical point, but in general it can be very hard (and sometimes a confound) to decide when to exclude subjects or trials. We can define cutoffs, but we may still get outliers within these cutoffs. Even with reasonable outlier exclusion rates, we often see that multilevel models help to model the data in a manner more robust to outliers.

5) What about the Bayesian stuff?

The Bayesian aspect of multilevel modeling was implicit in much of this tutorial today, but we didn’t really focus on it. However, Bayesian software like brms lends itself particuarly well to multilevel modeling frameworks where there are many parameters and optimization is complex – bayesian approaches will still give us models that don’t fit well if the data are too sparse (and we don’t set regularizing priors) but they won’t fail to converge in the same way maximum likelihood models do.

We haven’t talked much about the difference in approach when using Bayesian as opposed to frequentist modeling today – more on that soon!

Priors and predictivs in brms

Bayesian regression: theory & practice

Michael Franke

This tutorial covers how to inspect, set and sample priors in Bayesian regression models with brms . We also look at how to sample from the prior and posterior distribution. The main conceptual take-home message of this tutorial is: The choice of prior should be informed by their effect on the prior (and possibly also the posterior) predictive distribution. This emphasises the role of model criticism (prior and posterior) which a later tutorial will enlarge on.

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Inspect & set priors

We work with the mouse-tracking data from the aida package. As a running example, we look at the linear relation between (aggregates of) area-under-the-curve AUC and MAD . Here is the relevant data plot:

characteristics of hypothesis in brm

Let’s fit a simple Bayesian regression model:

We can inspect the priors used in in a model fit like so:

The table gives information about the kind of prior used for different parameters. Parameters are classified into different classes (column “class”). The “b” class contains the slope coeffiecients. Here, we only have one slope (for MAD), which is identified in the “coef” column. For more complex models, the other colums may be important (e.g., identifying groups in multi-level models, for parameters in distributional and non-linear models, as well as lower and upper bounded paramters).

This particular output tells us that the priors for the slope coefficient for the variable MAD was “flat”. Per default, brms uses so-called improper priors for slope coefficients, i.e., not specifying any prior at all, so that every parameter value is equally weighted (even if this is not a proper probability distribution since the support is infinite).

In contrast, brms /does/ use more specific, in fact rather smart, priors for the intercept and for the standard deviation. These priors are informed by the data. Look:

We can change the priors used to fit the model with the prior attribute and the brms::prior() function. Here, we set it to a normal (with ridiculously small standard deviation).

The brms::prior() function expects the prior to be specified as a Stan expression. Full documentation of the available functions for priors is in the Stan Functions Reference .

Fit a third model model3 as the previous ones, but set the prior for the slope coefficient to a Student’s \(t\) distribution with mean 0, standard deviation 100 and one degree of freedom.

Compare the mean posteriors for all three main parameters (intercept, slope for MAD and sigma) in all three models. What effect did changing the prior on the slope parameter have for models 2 and 3? Remember that the priors for these models are quite “unreasonable” in the sense that they are far away from the posterior obtained for model 1.

We see that the Student-t prior in model 3 gives a very similar fit as for model 1. This is likely due to the heavier tails of the Student-t distribution.

We also see that the more restricted model 2 has a much lower mean posterior for the slope coefficient (because this parameter is “leashed close to zero” by the prior). Instead, model 2 compensates with a much higher intercept estimate.

The important upshot of this exercise is that since all parameters jointly condition the likelihood function, it can happen that changing the priors for just one parameter will also affect the posterior inferences for other parameters (who have to “go out of their way” to compensate for what the other parameter can or cannot do, so to speak).

This raises the important, and controversial question of how to determine “good priors”. A simple, evasive but plainly true answer is: it depends on what you want to do with your model. Things you might want to do with your model include:

  • explore or monkey-around,
  • make serious predictions about the future (e.g., disease spread, market development),
  • or draw theoretical conclusions from data (e.g., which theory of reading-times in garden-path sentences is supported better by some data)).

In almost all cases, however, it is good advice to remember this: priors should be evaluated in the context of the (prior) predictions they entail . In other words, we need to think of priors as having a dual role:

  • priors encode assumptions about likely parameter values, and
  • priors inform the prior predictive distribution, and
  • priors may also bias our posterior estimates.

For complex models, with myriads of (hierarchically nested) parameters it may not be feasible to have, express or defend particular ideas about what plausible sets of parameter values are. In it, however, often much more feasible to have intuitions about the implications of particular choices of priors for the predictions that the model makes, e.g, about the data that would be likely to observe from an a prior point of view. Finally, the choice of priors can matter for parameter estimation. For one, complex models may require choices of adequate priors in order for us to be able to fit them with whatever technique we are using. So, we might consider a form of regularizing prior simply in order to make the model “well-behaved”, while at the same time making sure that we are not sneaking in unwarranted assumptions (back again to checking the prior predictive distribution). Moreover, the choice of priors might be biasing in ways that we did not intend. For these reasons, it may also be prudent to inspect the impact of the choice of priors on the posterior distribution. In the following, we therefore first look at how to take samples from the priors as such, and then how to obtain samples from the prior and posterior distribution.

Sample from prior

Here is how we can obtain samples from the prior distribution over parameters of a model. Sampling from the prior only works if priors are not the improper (flat) default priors. Firstly, we can use the option sample_prior = "only" to obtain only samples from the prior. (NB: we still need to supply the data because it is used for the setting up the model; e.g., specifying the prior for the intercept.)

It is also possible to obtain a posterior fit /and/ prior samples at the same time, but that is a bit more fickle, as the prior samples will have other names, and (AFAICS) other functions are required than for posterior samples, entailing other formatting of the returned samples.

A third possibility is to use stats::update() to draw additional prior samples from an already fitted object, like so:

Run a linear regression model on R’s cars data set, setting the priors exactly as we did for the WebPPL model a previous tutorial in this chapter, i.e., a normal prior (mean -18, standard deviation 5) for the intercept, a normal prior (mean 0, standard 10) for the slope, and a gamma prior (parameters 5 and 5) for the standard deviation. Then rerun the model to obtain only samples from the priors (using stats::update ). Plot the fits both with bayesplot::mcmc_dens .

characteristics of hypothesis in brm

Samples from the predictive distributions

For a simple Bayesian linear regression model, there are different predictive functions we need to distinguish. First, we distinguish the prior and the posterior perspective, of course. Second, we may interested in different kinds or locations of predictions. Concretely, here we look at predictions of central tendency (i.e., predictions of best fitting linear regression lines) and predictions of observations (i.e., data points around the predicted linear regression lines). Notice that the latter also takes the standard deviation into account, the former does not.

Let’s look at a data set of average world temperatures from the middle of the 18th century to the early 21st century. Here is a plot of our data points.

characteristics of hypothesis in brm

To obtain samples from predictive distributions from brms , we first need a model fit. So, let’s obtain a posterior fit and then use sample_prior = "only" to get a similarly structured model fit object with samples from the prior.

Samples of linear predictor values

We can obtain samples for values of the linear predictor using the function tidybayes::add_linpred_draws . Below we supply:

  • the fitted object (posterior or prior),
  • the \(x\) -values for which we want the predictions (here we use the \(x\) -values in the data set),
  • a number of samples
  • the name of the column that contains the samples in the returned data frame.

Similarly, for samples from the prior predictive:

The following plot shows each sample from the linear predictors as a thin grey line, and the actual data as red points. We see that the prior model makes a very wide range of predictions for regression lines, arguably quite implausibly so. The shape of the data points is hardly recognizable for the case of the prior because of that (see also the values on the \(y\) -axes). The second plot, for the posterior, shows that a posteriori credible regression lines are much more confined.

characteristics of hypothesis in brm

Let’s use prior / posterior predictives to investigate the plausibility of different priors for our regression model. We can use the following function plot_predictPriPost to conveniently visualize the predictives for different priors:

  • Inspect, interpret and evaluate the impact of the following “opinionated prior”.

characteristics of hypothesis in brm

These priors put stronger constraints on a priori plausible regression lines than the previous. This is still rather unconstrained, though, so it has little (visual) effect on the posterior.

  • Inspect, interpret and evaluate the impact of the following “crazy prior”.

characteristics of hypothesis in brm

These priors are very strong and make a priori plausible regression lines that predict a strongly decreasing world temperature over time. These priors are so strong that the data is not able to overturn the strong bias in these priors. This is a case where inspecting the posterior to evaluate the prior might also be useful. While this case is simple and transparent, sometimes the strength of prior assumptions shows clearly in the posterior fit. However, strongly biased priors may be used for a reason (e.g., the actually test this assumption, e.g., with Bayesian \(p\) -values).

Samples from the data-predictive distribution

We can also obtain samples of predicted hypothetical data \(y\) for a given set of values \(x\) , either from the prior or posterior perspective. We use the function tidybayes::add_predicted_draws for that, like so:

This gives us one sample of a hypothetical average temperature for each year ( \(x\) -value) in the data set. Here is a visualization, where the bigger blue points are predicted values (from the posterior), and the red points are the actually observed data points.

characteristics of hypothesis in brm

Below is another convenience function plot_predictPriPost_data to visualize the data-predictives for different priors.

Inspect, interpret and evaluate the impact of the following piors.

characteristics of hypothesis in brm

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » What is a Hypothesis – Types, Examples and Writing Guide

What is a Hypothesis – Types, Examples and Writing Guide

Table of Contents

What is a Hypothesis

Definition:

Hypothesis is an educated guess or proposed explanation for a phenomenon, based on some initial observations or data. It is a tentative statement that can be tested and potentially proven or disproven through further investigation and experimentation.

Hypothesis is often used in scientific research to guide the design of experiments and the collection and analysis of data. It is an essential element of the scientific method, as it allows researchers to make predictions about the outcome of their experiments and to test those predictions to determine their accuracy.

Types of Hypothesis

Types of Hypothesis are as follows:

Research Hypothesis

A research hypothesis is a statement that predicts a relationship between variables. It is usually formulated as a specific statement that can be tested through research, and it is often used in scientific research to guide the design of experiments.

Null Hypothesis

The null hypothesis is a statement that assumes there is no significant difference or relationship between variables. It is often used as a starting point for testing the research hypothesis, and if the results of the study reject the null hypothesis, it suggests that there is a significant difference or relationship between variables.

Alternative Hypothesis

An alternative hypothesis is a statement that assumes there is a significant difference or relationship between variables. It is often used as an alternative to the null hypothesis and is tested against the null hypothesis to determine which statement is more accurate.

Directional Hypothesis

A directional hypothesis is a statement that predicts the direction of the relationship between variables. For example, a researcher might predict that increasing the amount of exercise will result in a decrease in body weight.

Non-directional Hypothesis

A non-directional hypothesis is a statement that predicts the relationship between variables but does not specify the direction. For example, a researcher might predict that there is a relationship between the amount of exercise and body weight, but they do not specify whether increasing or decreasing exercise will affect body weight.

Statistical Hypothesis

A statistical hypothesis is a statement that assumes a particular statistical model or distribution for the data. It is often used in statistical analysis to test the significance of a particular result.

Composite Hypothesis

A composite hypothesis is a statement that assumes more than one condition or outcome. It can be divided into several sub-hypotheses, each of which represents a different possible outcome.

Empirical Hypothesis

An empirical hypothesis is a statement that is based on observed phenomena or data. It is often used in scientific research to develop theories or models that explain the observed phenomena.

Simple Hypothesis

A simple hypothesis is a statement that assumes only one outcome or condition. It is often used in scientific research to test a single variable or factor.

Complex Hypothesis

A complex hypothesis is a statement that assumes multiple outcomes or conditions. It is often used in scientific research to test the effects of multiple variables or factors on a particular outcome.

Applications of Hypothesis

Hypotheses are used in various fields to guide research and make predictions about the outcomes of experiments or observations. Here are some examples of how hypotheses are applied in different fields:

  • Science : In scientific research, hypotheses are used to test the validity of theories and models that explain natural phenomena. For example, a hypothesis might be formulated to test the effects of a particular variable on a natural system, such as the effects of climate change on an ecosystem.
  • Medicine : In medical research, hypotheses are used to test the effectiveness of treatments and therapies for specific conditions. For example, a hypothesis might be formulated to test the effects of a new drug on a particular disease.
  • Psychology : In psychology, hypotheses are used to test theories and models of human behavior and cognition. For example, a hypothesis might be formulated to test the effects of a particular stimulus on the brain or behavior.
  • Sociology : In sociology, hypotheses are used to test theories and models of social phenomena, such as the effects of social structures or institutions on human behavior. For example, a hypothesis might be formulated to test the effects of income inequality on crime rates.
  • Business : In business research, hypotheses are used to test the validity of theories and models that explain business phenomena, such as consumer behavior or market trends. For example, a hypothesis might be formulated to test the effects of a new marketing campaign on consumer buying behavior.
  • Engineering : In engineering, hypotheses are used to test the effectiveness of new technologies or designs. For example, a hypothesis might be formulated to test the efficiency of a new solar panel design.

How to write a Hypothesis

Here are the steps to follow when writing a hypothesis:

Identify the Research Question

The first step is to identify the research question that you want to answer through your study. This question should be clear, specific, and focused. It should be something that can be investigated empirically and that has some relevance or significance in the field.

Conduct a Literature Review

Before writing your hypothesis, it’s essential to conduct a thorough literature review to understand what is already known about the topic. This will help you to identify the research gap and formulate a hypothesis that builds on existing knowledge.

Determine the Variables

The next step is to identify the variables involved in the research question. A variable is any characteristic or factor that can vary or change. There are two types of variables: independent and dependent. The independent variable is the one that is manipulated or changed by the researcher, while the dependent variable is the one that is measured or observed as a result of the independent variable.

Formulate the Hypothesis

Based on the research question and the variables involved, you can now formulate your hypothesis. A hypothesis should be a clear and concise statement that predicts the relationship between the variables. It should be testable through empirical research and based on existing theory or evidence.

Write the Null Hypothesis

The null hypothesis is the opposite of the alternative hypothesis, which is the hypothesis that you are testing. The null hypothesis states that there is no significant difference or relationship between the variables. It is important to write the null hypothesis because it allows you to compare your results with what would be expected by chance.

Refine the Hypothesis

After formulating the hypothesis, it’s important to refine it and make it more precise. This may involve clarifying the variables, specifying the direction of the relationship, or making the hypothesis more testable.

Examples of Hypothesis

Here are a few examples of hypotheses in different fields:

  • Psychology : “Increased exposure to violent video games leads to increased aggressive behavior in adolescents.”
  • Biology : “Higher levels of carbon dioxide in the atmosphere will lead to increased plant growth.”
  • Sociology : “Individuals who grow up in households with higher socioeconomic status will have higher levels of education and income as adults.”
  • Education : “Implementing a new teaching method will result in higher student achievement scores.”
  • Marketing : “Customers who receive a personalized email will be more likely to make a purchase than those who receive a generic email.”
  • Physics : “An increase in temperature will cause an increase in the volume of a gas, assuming all other variables remain constant.”
  • Medicine : “Consuming a diet high in saturated fats will increase the risk of developing heart disease.”

Purpose of Hypothesis

The purpose of a hypothesis is to provide a testable explanation for an observed phenomenon or a prediction of a future outcome based on existing knowledge or theories. A hypothesis is an essential part of the scientific method and helps to guide the research process by providing a clear focus for investigation. It enables scientists to design experiments or studies to gather evidence and data that can support or refute the proposed explanation or prediction.

The formulation of a hypothesis is based on existing knowledge, observations, and theories, and it should be specific, testable, and falsifiable. A specific hypothesis helps to define the research question, which is important in the research process as it guides the selection of an appropriate research design and methodology. Testability of the hypothesis means that it can be proven or disproven through empirical data collection and analysis. Falsifiability means that the hypothesis should be formulated in such a way that it can be proven wrong if it is incorrect.

In addition to guiding the research process, the testing of hypotheses can lead to new discoveries and advancements in scientific knowledge. When a hypothesis is supported by the data, it can be used to develop new theories or models to explain the observed phenomenon. When a hypothesis is not supported by the data, it can help to refine existing theories or prompt the development of new hypotheses to explain the phenomenon.

When to use Hypothesis

Here are some common situations in which hypotheses are used:

  • In scientific research , hypotheses are used to guide the design of experiments and to help researchers make predictions about the outcomes of those experiments.
  • In social science research , hypotheses are used to test theories about human behavior, social relationships, and other phenomena.
  • I n business , hypotheses can be used to guide decisions about marketing, product development, and other areas. For example, a hypothesis might be that a new product will sell well in a particular market, and this hypothesis can be tested through market research.

Characteristics of Hypothesis

Here are some common characteristics of a hypothesis:

  • Testable : A hypothesis must be able to be tested through observation or experimentation. This means that it must be possible to collect data that will either support or refute the hypothesis.
  • Falsifiable : A hypothesis must be able to be proven false if it is not supported by the data. If a hypothesis cannot be falsified, then it is not a scientific hypothesis.
  • Clear and concise : A hypothesis should be stated in a clear and concise manner so that it can be easily understood and tested.
  • Based on existing knowledge : A hypothesis should be based on existing knowledge and research in the field. It should not be based on personal beliefs or opinions.
  • Specific : A hypothesis should be specific in terms of the variables being tested and the predicted outcome. This will help to ensure that the research is focused and well-designed.
  • Tentative: A hypothesis is a tentative statement or assumption that requires further testing and evidence to be confirmed or refuted. It is not a final conclusion or assertion.
  • Relevant : A hypothesis should be relevant to the research question or problem being studied. It should address a gap in knowledge or provide a new perspective on the issue.

Advantages of Hypothesis

Hypotheses have several advantages in scientific research and experimentation:

  • Guides research: A hypothesis provides a clear and specific direction for research. It helps to focus the research question, select appropriate methods and variables, and interpret the results.
  • Predictive powe r: A hypothesis makes predictions about the outcome of research, which can be tested through experimentation. This allows researchers to evaluate the validity of the hypothesis and make new discoveries.
  • Facilitates communication: A hypothesis provides a common language and framework for scientists to communicate with one another about their research. This helps to facilitate the exchange of ideas and promotes collaboration.
  • Efficient use of resources: A hypothesis helps researchers to use their time, resources, and funding efficiently by directing them towards specific research questions and methods that are most likely to yield results.
  • Provides a basis for further research: A hypothesis that is supported by data provides a basis for further research and exploration. It can lead to new hypotheses, theories, and discoveries.
  • Increases objectivity: A hypothesis can help to increase objectivity in research by providing a clear and specific framework for testing and interpreting results. This can reduce bias and increase the reliability of research findings.

Limitations of Hypothesis

Some Limitations of the Hypothesis are as follows:

  • Limited to observable phenomena: Hypotheses are limited to observable phenomena and cannot account for unobservable or intangible factors. This means that some research questions may not be amenable to hypothesis testing.
  • May be inaccurate or incomplete: Hypotheses are based on existing knowledge and research, which may be incomplete or inaccurate. This can lead to flawed hypotheses and erroneous conclusions.
  • May be biased: Hypotheses may be biased by the researcher’s own beliefs, values, or assumptions. This can lead to selective interpretation of data and a lack of objectivity in research.
  • Cannot prove causation: A hypothesis can only show a correlation between variables, but it cannot prove causation. This requires further experimentation and analysis.
  • Limited to specific contexts: Hypotheses are limited to specific contexts and may not be generalizable to other situations or populations. This means that results may not be applicable in other contexts or may require further testing.
  • May be affected by chance : Hypotheses may be affected by chance or random variation, which can obscure or distort the true relationship between variables.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Data collection

Data Collection – Methods Types and Examples

Delimitations

Delimitations in Research – Types, Examples and...

Research Process

Research Process – Steps, Examples and Tips

Research Design

Research Design – Types, Methods and Examples

Institutional Review Board (IRB)

Institutional Review Board – Application Sample...

Evaluating Research

Evaluating Research – Process, Examples and...

Research Graduate

The Best PhD and Masters Consulting Company

Characteristics Of A Good Hypothesis

Characteristics Of A Good Hypothesis​

What exactly is a hypothesis.

A hypothesis is a conclusion reached after considering the evidence. This is the first step in any investigation, where the research questions are translated into a prediction. Variables, population, and the relationship between the variables are all included. A research hypothesis is a hypothesis that is tested to see if two or more variables have a relationship. Now let’s have a look at the characteristics of a  good hypothesis.

 Characteristics of

A good hypothesis has the following characteristics.

 Ability To Predict

Closest to things that can be seen, testability, relevant to the issue, techniques that are applicable, new discoveries have been made as a result of this ., harmony & consistency.

  • The similarity between the two phenomena.
  • Observations from previous studies, current experiences, and feedback from rivals.
  • Theories based on science.
  • People’s thinking processes are influenced by general patterns.
  • A straightforward hypothesis
  • Complex Hypothesis
  • Hypothesis  with a certain direction
  •  Non-direction Hypothesis
  • Null Hypothesis
  • Hypothesis of association and chance

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Scientific Methods

What is Hypothesis?

We have heard of many hypotheses which have led to great inventions in science. Assumptions that are made on the basis of some evidence are known as hypotheses. In this article, let us learn in detail about the hypothesis and the type of hypothesis with examples.

A hypothesis is an assumption that is made based on some evidence. This is the initial point of any investigation that translates the research questions into predictions. It includes components like variables, population and the relation between the variables. A research hypothesis is a hypothesis that is used to test the relationship between two or more variables.

Characteristics of Hypothesis

Following are the characteristics of the hypothesis:

  • The hypothesis should be clear and precise to consider it to be reliable.
  • If the hypothesis is a relational hypothesis, then it should be stating the relationship between variables.
  • The hypothesis must be specific and should have scope for conducting more tests.
  • The way of explanation of the hypothesis must be very simple and it should also be understood that the simplicity of the hypothesis is not related to its significance.

Sources of Hypothesis

Following are the sources of hypothesis:

  • The resemblance between the phenomenon.
  • Observations from past studies, present-day experiences and from the competitors.
  • Scientific theories.
  • General patterns that influence the thinking process of people.

Types of Hypothesis

There are six forms of hypothesis and they are:

  • Simple hypothesis
  • Complex hypothesis
  • Directional hypothesis
  • Non-directional hypothesis
  • Null hypothesis
  • Associative and casual hypothesis

Simple Hypothesis

It shows a relationship between one dependent variable and a single independent variable. For example – If you eat more vegetables, you will lose weight faster. Here, eating more vegetables is an independent variable, while losing weight is the dependent variable.

Complex Hypothesis

It shows the relationship between two or more dependent variables and two or more independent variables. Eating more vegetables and fruits leads to weight loss, glowing skin, and reduces the risk of many diseases such as heart disease.

Directional Hypothesis

It shows how a researcher is intellectual and committed to a particular outcome. The relationship between the variables can also predict its nature. For example- children aged four years eating proper food over a five-year period are having higher IQ levels than children not having a proper meal. This shows the effect and direction of the effect.

Non-directional Hypothesis

It is used when there is no theory involved. It is a statement that a relationship exists between two variables, without predicting the exact nature (direction) of the relationship.

Null Hypothesis

It provides a statement which is contrary to the hypothesis. It’s a negative statement, and there is no relationship between independent and dependent variables. The symbol is denoted by “H O ”.

Associative and Causal Hypothesis

Associative hypothesis occurs when there is a change in one variable resulting in a change in the other variable. Whereas, the causal hypothesis proposes a cause and effect interaction between two or more variables.

Examples of Hypothesis

Following are the examples of hypotheses based on their types:

  • Consumption of sugary drinks every day leads to obesity is an example of a simple hypothesis.
  • All lilies have the same number of petals is an example of a null hypothesis.
  • If a person gets 7 hours of sleep, then he will feel less fatigue than if he sleeps less. It is an example of a directional hypothesis.

Functions of Hypothesis

Following are the functions performed by the hypothesis:

  • Hypothesis helps in making an observation and experiments possible.
  • It becomes the start point for the investigation.
  • Hypothesis helps in verifying the observations.
  • It helps in directing the inquiries in the right direction.

How will Hypothesis help in the Scientific Method?

Researchers use hypotheses to put down their thoughts directing how the experiment would take place. Following are the steps that are involved in the scientific method:

  • Formation of question
  • Doing background research
  • Creation of hypothesis
  • Designing an experiment
  • Collection of data
  • Result analysis
  • Summarizing the experiment
  • Communicating the results

Frequently Asked Questions – FAQs

What is hypothesis.

A hypothesis is an assumption made based on some evidence.

Give an example of simple hypothesis?

What are the types of hypothesis.

Types of hypothesis are:

  • Associative and Casual hypothesis

State true or false: Hypothesis is the initial point of any investigation that translates the research questions into a prediction.

Define complex hypothesis..

A complex hypothesis shows the relationship between two or more dependent variables and two or more independent variables.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Physics related queries and study materials

Your result is as below

Request OTP on Voice Call

Leave a Comment Cancel reply

Your Mobile number and Email id will not be published. Required fields are marked *

Post My Comment

characteristics of hypothesis in brm

  • Share Share

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

close

IMAGES

  1. BRM Types of Hypothesis Day 5 part 2 07/07/2020

    characteristics of hypothesis in brm

  2. 2.8 Hypothesis in Research (BRM) Frame hypothesis statements

    characteristics of hypothesis in brm

  3. Brm lecture9 hypothesis testing

    characteristics of hypothesis in brm

  4. 🏷️ Formulation of hypothesis in research. How to Write a Strong

    characteristics of hypothesis in brm

  5. PPT

    characteristics of hypothesis in brm

  6. Brm lecture9 hypothesis testing

    characteristics of hypothesis in brm

VIDEO

  1. Long Lifers

  2. Difference B/w Sex, Gender, Sexual Orientation & Gender Expression

  3. Did You Know the Difference Between Frogs and Toads

  4. Hypothesis Testing

  5. The Good Genes Hypothesis

  6. Formulation of hypothesis |Biological method

COMMENTS

  1. hypothesis function

    An R object typically of class brmsfit. hypothesis. A character vector specifying one or more non-linear hypothesis concerning parameters of the model. class. A string specifying the class of parameters being tested. Default is "b" for fixed effects. Other typical options are "sd" or "cor".

  2. Chapter 12 Bayesian estimation with brms

    12.1.1.1 Brms family. The family argument in brms::brm() is used to define the random part of the model. The brms package extends the options of the family argument in the glm() function to allow for a much wider class of likelihoods. You can see the help file (help("brmsfamily", package="brms")) for a full list of the current options.Some examples of available options are:

  3. Building a Multilevel Model in BRMS Tutorial: Popularity Data

    The popularity dataset contains characteristics of pupils in different classes. The main goal of this tutorial is to find models and test hypotheses about the relation between these characteristics and the popularity of pupils (according to their classmates). ... If we look at the different inputs for the brm() function we: have "popular ...

  4. A An introduction to Bayesian multilevel models using brms

    A.1.1 Bayesian data analysis. The Bayesian approach to data analysis differs from the frequentist one in that each parameter of the model is considered as a random variable (contrary to the frequentist approach which considers parameter values as unknown and fixed quantities), and by the explicit use of probability to model the uncertainty (Gelman et al., 2013).

  5. Estimating Multivariate Models with brms • brms

    Basic Multivariate Models. We begin with a relatively simple multivariate normal model. bform1 <- bf ( mvbind (tarsus, back) ~ sex + hatchdate + (1|p|fosternest) + (1|q|dam)) + set_rescor (TRUE) fit1 <- brm (bform1, data = BTdata, chains = 2, cores = 2) As can be seen in the model code, we have used mvbind notation to tell brms that both tarsus ...

  6. How to calculate contrasts from a fitted brms model

    These transformations are often called "contrasts" or "general linear hypothesis tests". But really, they are just transformations of the joint posterior distribution of the model's parameters. To answer Q2, then, we encode our question into a combination of the models parameters: q2 <- c(q2 = "time + time:treatment1 = 0") The slope ...

  7. An Introduction to brms

    Generate data. First, we'll generate two independent normally distributed samples. These will correspond to two levels of a grouping variable, so let's call them group A and group B. Group A will have a mean μA = 20 and a standard deviation σA = 2, whereas group B have have the parameters μB = 16 and σB = 1.5.

  8. Influence of Priors: Popularity Data

    Step 2: Downloading the data. The popularity dataset contains characteristics of pupils in different classes. The main goal of this tutorial is to find models and test hypotheses about the relation between these characteristics and the popularity of pupils (according to their classmates).

  9. R Linear Regression Bayesian (using brms)

    Regression - Default Priors. In this exercise you will investigate the impact of Ph.D. students' \(age\) and \(age^2\) on the delay in their project time, which serves as the outcome variable using a regression analysis (note that we ignore assumption checking!). As you know, Bayesian inference consists of combining a prior distribution with the likelihood obtained from the data.

  10. Estimating Distributional Models with brms • brms

    Fortunately, brms uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue. Suppose we have a normally distributed response variable. Then, in basic linear regression, we specify a predictor term ημ η μ for the mean parameter μ μ of the ...

  11. hypothesis.brmsfit : Non-Linear Hypothesis Testing

    x: An R object. If it is no brmsfit object, it must be coercible to a data.frame.In the latter case, the variables used in the hypothesis argument need to correspond to column names of x, while the rows are treated as representing posterior draws of the variables.. hypothesis: A character vector specifying one or more non-linear hypothesis concerning parameters of the model.

  12. Matti's homepage

    Recently, there's been a lot of talk about meta-analysis, and here I would just like to quickly show that Bayesian multilevel modeling nicely takes care of your meta-analysis needs, and that it is easy to do in R with the rstan and brms packages. As you'll see, meta-analysis is a special case of Bayesian multilevel modeling when you are ...

  13. Chapter 2 Binomial Modeling

    2.4 Prior on the logit parameter. Since we will writing a model in terms of the logit function θ = log( p 1 −p) θ = log ( p 1 − p) We want to find a corresponding normal prior on θ θ. A simple way of doing this is by simulation …. Simulate 1000 draws from the beta prior on p p. Compute θ θ on these simulated draws of p p.

  14. brms source: R/hypothesis.R

    R/hypothesis.R defines the following functions: plot.brmshypothesis print.brmshypothesis round_numeric evidence_ratio density_ratio find_vars eval_hypothesis combine_hlist hypothesis_coef .hypothesis hypothesis.default hypothesis hypothesis.brmsfit

  15. How to properly compare interacting levels

    I am new to brms, trying to figure out its behaviour in details. I was trying to run essentially an anova-like model with random effects (anova-like in the sense that all IVs are factors). For example, brm(DV ~ A * C + B * C + (1|item) + (1|participant), ... conditional_effects() and hypothesis() are truly amazing analytic tools. However, when you deal with factors and, in particular, their ...

  16. Bayesian Multilevel Modeling with brms

    2) Multilevel regression model syntax! Here is the general syntax for modeling in two popular packages, lme4 and brms. In general, this syntax looks very similar to the lm () syntax in R. In multilevel regression models, we can let different groups (lets say subjects here) have their own intercepts or slopes or both.

  17. Bayesian Regression: Theory & Practice

    This tutorial covers how to inspect, set and sample priors in Bayesian regression models with brms. We also look at how to sample from the prior and posterior distribution. The main conceptual take-home message of this tutorial is: The choice of prior should be informed by their effect on the prior (and possibly also the posterior) predictive ...

  18. What is a Hypothesis

    Definition: Hypothesis is an educated guess or proposed explanation for a phenomenon, based on some initial observations or data. It is a tentative statement that can be tested and potentially proven or disproven through further investigation and experimentation. Hypothesis is often used in scientific research to guide the design of experiments ...

  19. Testing Hypotheses

    Up to now the focus has been on point estimation, interpretation of estimators and model validation. In this chapter, the important concept of hypothesis testing is considered and various tests for the BRM are derived. It is interesting to note that in order to achieve some of the results for the BRM, knowledge about the \(EBRM_B^2\) and the \(EBRM_W^2\) is useful.

  20. Characteristics Of A Good Hypothesis

    A good hypothesis has the following characteristics. Ability To Predict One of the most valuable qualities of a good hypothesis is the ability to anticipate the future. It not only clarifies the current problematic scenario, but also predicts what will happen in the future. As a result of the predictive capacity, hypothesis is the finest ...

  21. What is Hypothesis

    Functions of Hypothesis. Following are the functions performed by the hypothesis: Hypothesis helps in making an observation and experiments possible. It becomes the start point for the investigation. Hypothesis helps in verifying the observations. It helps in directing the inquiries in the right direction.