Introduction

The htobit package fits tobit regression models with conditional heteroscedasticy using maximum likelihood estimation. The model assumes an underlying latent Gaussian variable

\[y_i^* \sim \mathcal{N}(\mu_i, \sigma_i^2)\]

which is only observed if positive and zero otherwise: \(y_i = \max(0, y_i^*)\). The latent mean \(\mu_i\) and scale \(\sigma_i\) (latent standard deviation) are linked to two different linear predictors \[ \begin{aligned} \mu_i & = x_i^\top \beta \\ \log(\sigma_i) & = z_i^\top \gamma \end{aligned} \] where the regressor vectors \(x_i\) and \(z_i\) can be set up without restrictions, i.e., they can be identical, overlapping or completely different or just including an intercept, etc.

See also Messner, Mayr, and Zeileis (2016) for a more detailed introduction to this model class as well as a better implementation in the package crch. The main purpose of htobit is to illustrate how to create such a package from scratch.


Implementation

As usual in many other regression packages for R (R Core Team 2019), the main model fitting function htobit() uses a formula-based interface and returns an (S3) object of class htobit:

htobit(formula, data, subset, na.action,
  model = TRUE, y = TRUE, x = FALSE,
  control = htobit_control(...), ...)

Actually, the formula can be a two-part Formula (Zeileis and Croissant 2010), specifying separate sets of regressors \(x_i\) and \(z_i\) for the location and scale submodels, respectively.

The underlying workhorse function is htobit_fit() which has a matrix interface and returns an unclassed list.

A number of standard S3 methods are provided:

Method Description
print() Simple printed display with coefficients
summary() Standard regression summary; returns summary.htobit object (with print() method)
coef() Extract coefficients
vcov() Associated covariance matrix
predict() (Different types of) predictions for new data
fitted() Fitted values for observed data
residuals() Extract (different types of) residuals
terms() Extract terms
model.matrix() Extract model matrix (or matrices)
nobs() Extract number of observations
logLik() Extract fitted log-likelihood
bread() Extract bread for sandwich covariance
estfun() Extract estimating functions (= gradient contributions) for sandwich covariances
getSummary() Extract summary statistics for mtable()

Due to these methods a number of useful utilities work automatically, e.g., AIC(), BIC(), coeftest() (lmtest), lrtest() (lmtest), waldtest() (lmtest), linearHypothesis() (car), mtable() (memisc), Boot() (car), etc.


Illustration

To illustrate the package’s use in practice, a comparison of several homoscedastic and heteroscedastic tobit regression models is applied to data on budget shares of alcohol and tobacco for 2724 Belgian households (taken from Verbeek 2004). The homoscedastic model from Verbeek (2004) can be replicated by:

data("AlcoholTobacco", package = "htobit")
library("htobit")
ma <- htobit(alcohol ~ (age + adults) * log(expenditure) + oldkids + youngkids, data = AlcoholTobacco)
summary(ma)
## 
## Call:
## htobit(formula = alcohol ~ (age + adults) * log(expenditure) + oldkids + 
##     youngkids, data = AlcoholTobacco)
## 
## Standardized residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0698 -0.4407 -0.1364  0.3934  8.3170 
## 
## Coefficients (location model):
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -0.1591533  0.0437782  -3.635 0.000278 ***
## age                      0.0134938  0.0108824   1.240 0.214989    
## adults                   0.0291901  0.0169469   1.722 0.084989 .  
## log(expenditure)         0.0126679  0.0032156   3.939 8.17e-05 ***
## oldkids                 -0.0026408  0.0006049  -4.366 1.27e-05 ***
## youngkids               -0.0038789  0.0023835  -1.627 0.103651    
## age:log(expenditure)    -0.0008093  0.0008006  -1.011 0.312067    
## adults:log(expenditure) -0.0022484  0.0012232  -1.838 0.066051 .  
## 
## Coefficients (scale model with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.71236    0.01533  -242.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Log-likelihood:  4755 on 9 Df
## Number of iterations in BFGS optimization: 16

This model is now modified in two directions: First, the variables influencing the location parameter are also employed in the scale submodel. Second, because the various coefficients for different numbers of persons in the household do not appear to be very different, a restricted specification for this is used. Using a Wald test for testing linear hypotheses from car (Fox and Weisberg 2019) yields

library("car")
linearHypothesis(ma, "oldkids = youngkids")
## Linear hypothesis test
## 
## Hypothesis:
## oldkids - youngkids = 0
## 
## Model 1: restricted model
## Model 2: alcohol ~ (age + adults) * log(expenditure) + oldkids + youngkids
## 
##   Df  Chisq Pr(>Chisq)
## 1                     
## 2  1 0.2639     0.6075
linearHypothesis(ma, "oldkids = adults")
## Linear hypothesis test
## 
## Hypothesis:
## - adults  + oldkids = 0
## 
## Model 1: restricted model
## Model 2: alcohol ~ (age + adults) * log(expenditure) + oldkids + youngkids
## 
##   Df  Chisq Pr(>Chisq)  
## 1                       
## 2  1 3.4994    0.06139 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Therefore, the following models are considered:

AlcoholTobacco$persons <- with(AlcoholTobacco, adults + oldkids + youngkids)
ma2 <- htobit(alcohol ~ (age + adults) * log(expenditure) + oldkids + youngkids |
  (age + adults) * log(expenditure) + oldkids + youngkids, data = AlcoholTobacco)
ma3 <- htobit(alcohol ~ age + log(expenditure) + persons | age + log(expenditure) + persons, data = AlcoholTobacco)
BIC(ma, ma2, ma3)
##     df       BIC
## ma   9 -9439.553
## ma2 16 -9735.109
## ma3  8 -9777.154

The BIC would choose the most parsimonious model but a likelihood ratio test would prefer the unconstrained person coefficients:

library("lmtest")
lrtest(ma, ma2, ma3)
## Likelihood ratio test
## 
## Model 1: alcohol ~ (age + adults) * log(expenditure) + oldkids + youngkids
## Model 2: alcohol ~ (age + adults) * log(expenditure) + oldkids + youngkids | 
##     (age + adults) * log(expenditure) + oldkids + youngkids
## Model 3: alcohol ~ age + log(expenditure) + persons | age + log(expenditure) + 
##     persons
##   #Df LogLik Df   Chisq Pr(>Chisq)    
## 1   9 4755.4                          
## 2  16 4930.8  7 350.925  < 2.2e-16 ***
## 3   8 4920.2 -8  21.234   0.006551 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Replication

To assess the reliability of the htobit() implementation, it is benchmarked against the crch() function of Messner, Mayr, and Zeileis (2016), using the same restricted model as above.

library("crch")
ca3 <- crch(alcohol ~ age + log(expenditure) + persons | age + log(expenditure) + persons,
  data = AlcoholTobacco, left = 0)

Using a model table from memisc (Elff 2019) it can be easily seen the results can be replicated using both packages:

library("memisc")
mtable("htobit" = ma3, "crch" = ca3)
## 
## Calls:
## htobit: htobit(formula = alcohol ~ age + log(expenditure) + persons | 
##     age + log(expenditure) + persons, data = AlcoholTobacco)
## crch: crch(formula = alcohol ~ age + log(expenditure) + persons | age + 
##     log(expenditure) + persons, data = AlcoholTobacco, left = 0)
## 
## ========================================================================
##                                 htobit                   crch           
##                        ----------------------- -----------------------   
##                           location     scale      location     scale    
## ------------------------------------------------------------------------
##   (Intercept)              -0.072***  0.176        -0.072***  0.176     
##                            (0.014)   (0.515)       (0.014)   (0.515)    
##   age                       0.002***  0.064***      0.002***  0.064***  
##                            (0.000)   (0.013)       (0.000)   (0.013)    
##   log(expenditure)          0.006*** -0.278***      0.006*** -0.278***  
##                            (0.001)   (0.038)       (0.001)   (0.038)    
##   persons                  -0.002*** -0.111***     -0.002*** -0.111***  
##                            (0.000)   (0.014)       (0.000)   (0.014)    
## ------------------------------------------------------------------------
##   Aldrich-Nelson R-sq.                                                  
##   McFadden R-sq.                                                        
##   Cox-Snell R-sq.                                                       
##   Nagelkerke R-sq.                                                      
##   Likelihood-ratio                                                      
##   p                                                                     
##   Log-likelihood         4920.217                4920.217               
##   Deviance                                                              
##   AIC                   -9824.433               -9824.433               
##   BIC                   -9777.154               -9777.154               
##   N                      2724                    2724                   
## ========================================================================

References

Elff, Martin. 2019. memisc: Tools for Management of Survey Data and the Presentation of Analysis Results. https://CRAN.R-project.org/package=memisc.

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. 3rd ed. Thousand Oaks: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.

Messner, Jakob W., Georg J. Mayr, and Achim Zeileis. 2016. “Heteroscedastic Censored and Truncated Regression with crch.” The R Journal 8 (1): 173–81. https://doi.org/10.32614/RJ-2016-012.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Verbeek, M. 2004. A Guide to Modern Econometrics. 2nd ed. Chichester: John Wiley & Sons.

Zeileis, Achim, and Yves Croissant. 2010. “Extended Model Formulas in R: Multiple Parts and Multiple Responses.” Journal of Statistical Software 34 (1): 1–13. https://doi.org/10.18637/jss.v034.i01.