Penn State Shield
mid3.jpg
The Methodology Center
LCA & LTA FAQ
Home arrow Research Areas arrow LCA & LTA arrow FAQ

PROC LCA & PROC LTA Software


Overview


Fitting Your Model


Selecting Your Model


Interpreting Your Model


Advanced Questions




PROC LCA & PROC LTA

How many indicators can PROC LCA or PROC LTA handle?
The limit on the number of indicators PROC LCA or PROC LTA can handle is 999. However, as you add indicators the size of the contingency table (and often model complexity) increases substantially. Depending on the model and data, it may be possible to include up to 20 or 30 indicators, or even more.  If you get an error message about the estimation engine not being able to fit the saturated model when you have a large number of indicators and/or response categories, see below.

How many measurement occasions can PROC LTA handle?
The limit on the number of measurement occasions that PROC LTA can handle is 99. However, practical limitations driven by the amount of memory on the computer and sparseness of the raw data contingency table exist. For these practical reasons, typical users specify five or fewer occasions of measurement.

How are the AIC and BIC in the output computed?
The AIC and BIC are penalized log-likelihood (LL) information criteria. The penalty for the AIC is two times the number of estimated parameters, and the penalty for the BIC is log(N) times the number of estimated parameters. In PROC LCA, the G2 statistic is penalized. The G2 equals -2(LL overall model – LL saturated model). Because the LL for the saturated model is a constant for a given data set, basing the AIC and BIC on this statistic is equivalent to basing it on -2LL of the overall model. Because the AIC and BIC are based on the G2 statistic, they will not necessarily match those from other software packages if they are based on the LL.

Does this software produce standard errors and/or confidence intervals?
Unfortunately, the current version of PROC LCA and PROC LTA does not provide standard errors for the parameter estimates. We plan to release a new version of the software that provides standard errors based on Bayesian estimation (these standard errors tend to perform better than those obtained using likelihood-based methods, as parameters are often on or near the boundary of the parameter space). The WinLTA software does include a Bayesian estimation option (data augmentation; DA) that produces standard errors. This approach uses the parameter estimates from the EM procedure as a starting point. In other words, a user must first use the EM procedure to arrive at the final model, and then use DA to obtain final parameter estimates and standard errors. DA also allows the user tremendous flexibility in hypothesis testing, because hypothesis tests can be performed on any linear or nonlinear combination of parameters. For more information on DA in WinLTA, please refer to the manual on data augmentation available on the WinLTA downloads page and Lanza, Collins, Schafer, & Flaherty (2005) listed in the bibliography.

Can this software incorporate sampling weights?

At this time, PROC LCA and PROC LTA cannot incorporate sampling weights. If you think that the design effects are not strong in your study, you might be able to demonstrate some of this by reporting descriptive statistics weighted and unweighted, and then proceed without including them.

Can I obtain the predicted probability of membership in each latent class or latent status for each individual?

Yes. Posterior probabilities of class or status membership are available by using the OUTPOST option in PROC LCA and PROC LTA. See the user's guide (available at the Download Center) for details on the syntax. If the goal is to assign individuals to a latent class based on their predicted probabilities and link class membership to outcomes, note that this approach does not incorporate the uncertainty of class membership into the analysis, thus biasing inference.

Can the expected frequency table be output?
Although the OUTPOST statement generates posterior probabilities, there is no expected frequency table output. It is not possible to get the frequency distribution of class assignment from the PROC LCA output.  However, interested users can calculate them easily by saving the posterior probabilities to a SAS data file using the OUTPOST option. Then, in a SAS data step, individuals can be assigned to the class with the highest posterior probability. In fact, there are numerous straight-forward calculations one may wish to do using the posterior probabilities, including the calculation for entropy and the odds of correct classification (as described in Nagin, 2005).

How do I calculate the significance test for a covariate by hand?
Significance tests are conducted as follows:

1)  Fit a model with all covariates of interest; save loglikelihood as 'loglik'
2)  For each covariate, fit the model with that covariate removed; save loglikelihood as 'll_beta_test'

(the rest of the steps are done for each covariate)

3)  delta2ll = ( 2 * loglik ) - ( 2 * ll_beta_test )
4)  df = ( number of latent classes - 1 ) * ( number of groups )

OR, for binary regression, df = number of groups

5)  Compute the CDF of a Chi-square distribution for the value 'delta2ll', with 'df' degrees of freedom.  This is the p-vale for that covariate.

Does this software handle missing data?

Yes, PROC LCA and PROC LTA handle missing data on the indicators so that you can make use of all the data you have. In other words, it is not necessary to delete cases that have partial data. You just need to code all missing data as SAS system missing (.). Missing data is handled with a full-information maximum likelihood (FIML) technique. Keep in mind that this procedure assumes that data are missing at random (MAR). However, even when the MAR assumption is not met, this missing data procedure performs better than casewise deletion. Note, however, that cases missing values on one or more covariates or on the grouping variable are removed from the analysis.


Can PROC LCA and PROC LTA be installed on the UNIX/LINUX version of SAS?
Not at this time. PROC LCA and PROC LTA are only available for SAS V9 for Windows.

Can I use PROC LCA or PROC LTA to do latent trajectories analysis?
No. Latent trajectories analysis is a different procedure that requires software such as MPlus or SAS Proc Traj.

I got an error message when I tried to run PROC LCA or PROC LTA, what does it mean?
You may have gotten one of several common error messages.  If you do not see the error message below, send an email to This e-mail address is being protected from spambots. You need JavaScript enabled to view it for help with troubleshooting.

Error message:

'LCA' is not recognized as an internal or external command operable program or batch file.
In addition, the window opened C:\WINNT\system32\cmd.exe

This problem is most likely due to a failed installation of one of the two files you need to run PROC LCA. To fix it, redownload the PROC LCA installation file from the Methodology Center's website (http://methodology.psu.edu/index.php/downloads/proclcalta) to your desktop. Then, open the *.zip file and rerun the setup.msi program. Doing so should restart the installer. When the appropriate window appears, click the "Repair" option. Then, finish the installation as directed by the installer. Once you have reinstalled, rerun your SAS PROC LCA program.

Error message:
ERRORS OCCURRED:
(The following errors were reported by the estimation engine.)
Argument to exp function too large
exp for gamma
OCCURRED IN: lta_mstep_beta in MOD lta_estimation
M-step for BETA failed at iteration ***.
OCCURRED IN: run_lta_em_modelfit in MOD lta_estimation


This is a numerical problem during logistic regression.  It occurs when one of the beta parameters (coefficients) starts getting very large. One of the gamma parameters has gone to zero. Be sure that you have run the model without covariates to see if the number of latent classes is correct. Alternatively, there may be a problem with the way the covariates have been coded. There may be missingness in the covariates that causes an entire covariate/group combination to drop out.

Error message:
A computational error message at iteration x.

To explore problems with the specified model, you can estimate the model again, adding a statement to terminate estimation at some iteration smaller than x, for example:

MAXITER x-1

where x-1 is the number of the last successful iteration prior to the error message. This may provide a clue as to parameters that are hitting boundaries (in particular, look for estimates approaching zero). You may wish to produce an optional output file (e.g., using the OUTPARAM option) so that you can examine parameter estimates with greater precision.

Error message:
WARNING: The estimation engine was not able to fit the saturated model
in order to adjust G-squared to account for missing data.
This may be due to having a large number of response items.
The G-squared, AIC and BIC fit statistics will NOT be provided in the output.
The error message is as follows:
There is a zero or negative number
in the nrc or nrs array.
OCCURRED IN: md_preliminaries in MOD lta_estimation
Could not initialize arrays for CAT model.
OCCURRED IN: run_lta_compute_fit_stats in MOD lta_estimation


This means that it is not possible to adjust the G-squared statistic for missing data when you include that many indicators and/or response categories. The 'cat model' (i.e., the saturated model) must be fit to determine the proper G-squared adjustment to make sure the final G-square for the model properly reflects lack of fit. However, with many indicators and/or response categories (i.e., a very large contingency table) it is not possible to estimate the cat model. Therefore, fit of the model cannot be evaluated because it has not been adjusted for the missing data. However, the parameter estimates can be trusted (assuming that the model is identified). One option may be to identify some sets of indicators that could be combined logically without losing much information. That way, you could fit a model almost as large but still adjust for missing data in the fit statistics.


Whom should I contact when I have problems using PROC LCA or PROC LTA?
Send an email to: This e-mail address is being protected from spambots. You need JavaScript enabled to view it .

What other software packages are available for LCA or LTA?
Here are a few links to other software packages with features related to LCA:
Back to Top




Overview

What is latent transition analysis?
Latent class theory is a measurement theory based on the idea of a static (i.e. unchanging) latent variable that divides a population into mutually exclusive and exhaustive latent classes. Categorical manifest items serve as indicators of the latent variable. Latent transition analysis (LTA) is a special case of latent class theory where the latent variable is dynamic, i.e. changing in systematic ways over time.

What types of variables do I need for LCA or LTA?
PROC LCA, PROC LTA, and WinLTA require categorical manifest variables to measure categorical latent variables.

What is the difference between latent variables and manifest variables?
Latent variables are unobserved variables that are measured by multiple observed items, also called manifest variables. For instance, to examine substance use onset as a latent variable, multiple observed items measuring behaviors such as alcohol use, cigarettes use, marijuana use, etc. may be used as indicators or manifest variables.

Do I need cross-sectional or longitudinal data?
This depends on the kind of model you wish to fit. If you are interested in estimating the number of classes at a single time of measurement, you will want to fit a latent class model and therefore cross-sectional data is what you need. If you are interested in estimating transitions between stages over time, you are interested in an LTA model and therefore longitudinal data are essential.

I want to read more about LCA and LTA, can you recommend some articles?
You can refer to the bibliography, where introductory and advanced papers are suggested.

How is LCA different from cluster analysis?
For a concise discussion comparing LCA to K-means cluster analysis, see the following article:

Magidson, J., & Vermunt, J. K. (2002). Latent class models for clustering: A comparison with K-means. Canadian Journal of Marketing Research, 20, 37-44.

Available from Statistical Innovations at http://www.statisticalinnovations.com/articles/cjmr.pdf

Back to Top




Fitting Your Model

How big should my sample size be in order to conduct LCA or LTA?
This depends on a number of factors, such as the overall size of the contingency table and how saturated the model is. In our experience, LTA works best with sample sizes of at least 300 or greater. Larger samples may be needed for some problems, particularly those where many indicators are involved or the model is complex. Remember that in many cases it is reasonable (and often desirable) to restrict measurement error (i.e., constrain rho parameters to be equal) across times and/or groups.

What should I do if I have many variables and I want to include them all as indicators?
We recommend creating composite variables for use as indicators. This is often a great way to obtain good manifest indicators.

What does it mean that the model has an identification problem?
In order for parameter estimation to proceed properly, there must be enough independent information available from data to produce the parameter estimates. Identification problems tend to occur under the following conditions: a lot of parameters are being estimated; the sample size is small in relation to the maximum possible number of response patterns; or the rho parameters are close to 1/number of response categories rather than closer to zero or one. Often, adding reasonable parameter restrictions in order to reduce the number of parameters being estimated will help to achieve an identified model.

What does a negative G2 value mean?
A model should never have a negative G2. A negative G2 value signals that something is wrong, often with the input data. For example, a negative G2 value may be caused by a mistake in reading in the data, such as inputting data as response pattern proportions rather than integer frequencies.

Why is the number of degrees of freedom in my model negative?
A model should never have negative degrees of freedom. Degrees of freedom (df) are equal to the number of possible cells (k) minus the number of parameters estimated (p) minus one (df=k-p-1). A model will have negative degrees of freedom when the model is trying to estimate more parameters than it is possible to estimate. If you have negative degrees of freedom, reduce the number of latent classes or latent statuses, or add parameter restrictions to reduce the number of parameters being estimated.

Do I have to impose equality constraints on measurement error (i.e., rho) parameters across time?
No, you do not have to impose equality constraints. However, it is often a good idea to do so, because this keeps the meaning of the latent statuses the same over time. This corresponds to the idea of factor invariance in factor analysis. Sometimes, however, you may want to explore the latent class structure separately for each time to get a sense of what underlying groups there may be in your population. You also may want to model multiple times together without restricting measurement to be equal over time. In this case, the number of classes generally has to be equal unless you use a flexible structural equation modeling package that allows you to condition class membership at time 2 on class membership at time 1. One thing to keep in mind is that if a latent class does not exist at time 1 but does at time 2, it is okay for the class membership probability for that class to be (essentially) zero at time 1, with people transitioning into it at time 2. This could be substantively interesting. Also keep in mind that if you allow measurement error to vary across time, it is a good idea to run multiple sets of starting values because identification may be difficult with these larger models.

Why do some models with multiple time points take so long to run?
Run times increase exponentially as you add time points, especially when there are missing data and/or when there are many indicator variables with many levels. When there are missing data, run times can get very long for large numbers of indicators and/or levels per indicator. If your indicators have many levels, it may help to re-code your indicators so that they have 2 or 3 levels. Alternatively, if there are not very many individuals with missing data, you could try removing those cases, if it is appropriate.

Back to Top



Selecting Your Model

How do I assess the fit of my model?
PROC LCA and PROC LTA provide the likelihood-ratio chi-square statistic, denoted G2, as well as the AIC and BIC information criteria. Goodness-of-fit, or absolute model fit, can be assessed by comparing the observed response pattern proportions to the response pattern proportions predicted by the model. If the model as estimated is a good representation of the data, then it will predict the response pattern proportions with a high degree of accuracy. A poor model will not be able to reproduce the observed response pattern proportions very well. The G2 statistic expresses the correspondence between the observed and predicted response patterns. For ordinary contingency table models the G2 is distributed as a chi-squared; unfortunately, for large contingency table models common in LTA, the chi-square becomes an inaccurate approximation of the G2 distribution. A very rough rule of thumb is that a good model has a goodness-of-fit statistic (G2 value) lower than the degrees of freedom. Relative model fit (that is, deciding which of several models is optimal in terms of balancing fit and parsimony) can be assessed with the AIC and BIC. Models with lower AIC and BIC are optimal.

What do I do if the AIC and BIC do not agree?

AIC and BIC are both penalized-likelihood criteria. AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Each, despite its heuristic usefulness, has therefore been criticized as having questionable validity for real-world data. But despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC. In general, it might be best to use AIC and BIC together in model selection. For example, in selecting the number of latent classes in a model, if BIC points to a three-class model and AIC points to a five-class model, it makes sense to select from models with 3, 4 and 5 latent classes. AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

Back to Top



Interpreting Your Model

How do I interpret the measurement error (i.e., rho) parameters? Is the latent variable being measured well by the indicators in the model?
You are probably familiar with another latent variable model, factor analysis. In factor analysis, a manifest variable's loading on a factor represents the relation between the variable and the factor. Because factor loadings are regression coefficients, a factor loading of zero represents no relation between the manifest variable and the factor, whereas larger factor loadings reflect a stronger relation. In other words, all else being equal, when factor loadings are large the latent variable is being measured better. In latent class and latent transition models, rho parameters play the same conceptual role as factor loadings; however, they are NOT regression coefficients, so they are scaled differently and their interpretation is somewhat different. Rho parameters are probabilities. The closer the rho parameters are to zero and one, the more the responses to the manifest variable are determined by the latent class or latent status. The closer the rho parameters are to 1/(number of response alternatives) - this is .5 for binary variables - the weaker the relation between the manifest variable and the latent class/status. In other words, all else being equal, when rho parameters are close to zero and one, the latent variable is being measured better. Another consideration is the overall pattern of rho parameters. Ideally, the pattern of rho parameters clearly identifies latent classes/statuses with distinguishable interpretations. This is similar to the concept of simple structure in factor analysis.

Back to Top




Advanced Questions

Why might I want to use cross-validation?
Cross-validation can be an alternative to traditional goodness-of-fit testing when there are several plausible models to be compared, and there are problems associated with the distribution of the G2 fit statistic, such as when the sample size is small. Crossvalidation involves splitting a sample into two (or more) subsamples, for example, Sample A and Sample B, and fitting a series of plausible models to each sample. Each model is fitted to Sample A (the calibration sample), the predicted response frequencies for each model are compared to the observed response frequencies in Sample B (the crossvalidation sample), and G2 is computed. Then the reverse is done; each model is fitted to Sample B (now the calibration sample), the predicted response frequencies for this model are compared to the observed response frequencies in Sample A (now the crossvalidation sample), and another G2 is computed. A model crossvalidates well if the G2 is relatively small when the estimated model is applied to a crossvalidation sample. When a series of models is tested, the model or models that crossvalidate best are considered best-fitting.

Can violations of the assumption of local independence among manifest variables be assessed?
One straight-forward approach is to assign individuals to the latent class in which they have the highest posterior probability of membership (these probabilities can be saved in a SAS data file using the OUTPOST option). Then, relationships among all indicators of your latent class variable can be explored separately for each group of individuals (i.e., for each class).

More sophisticated (and statistically sound) ways to explore local dependence have been explored.  One of these procedures is outlined by Bandeen-Roche, Migloretti, Zeger, and Rathouz (1997), where they multiply impute latent class membership and look for violations of this assumption within each imputation.  This procedure requires Bayesian estimation, which will be included in a future release of PROC LCA and PROC LTA.  For now, the WinLTA software (available at http://methodology.psu.edu/downloads/winlta) can be used to impute the latent class variable for models with no covariates.

Can I assign individuals to latent classes or latent statuses based on their posterior probabilities?
We do not recommend assigning individuals to latent classes or latent statuses based on their posterior probabilities unless there is no viable alternative. By assigning individuals to latent classes or latent statuses, you introduce error into your results. There are many different types of analyses that can be performed within the latent class modeling framework (e.g., predicting latent class membership) without having to assign individuals to latent classes or latent statuses. When possible, we recommend working within the latent class modeling framework because it incorporates measurement error into the model, which is ignored by class/status assignment. If you are planning to assign individuals based on posterior probabilies, one article of interest might be:

Goodman, L. A. (2007). On the assignment of individuals to latent classes.  Sociological Methodology, 37(1), 1-22. doi: 10.1111/j.1467-9531.2007.00184.x

In this paper, Goodman describes two ways to assign individuals and two criteria that can be used to assess when class assignment is satisfactory and when it is not.  If you assign individuals to classes/statuses, we recommend evaluating the amount of measurement error introduced by doing so.

Back to Top
 
Search the Penn State Directory Search the Penn State Department Directory Search Penn State