PROC LCA & PROC LTA Software
Installation of PROC LCA/LTA
PROC LCA/LTA is an add-on software to SAS V9 for Windows. To install PROC LCA/LTA properly, you should
- have SAS 9.x installed on my computer (Windows system),
- download and unzip proc_lca_lta_1_3_0_setup.zip,
- run the file "proc_lca_lta_1_3_0.msi",
- follow the steps until completion,
- open SAS, and
- run a PROC LCA/LTA example code to test your installation.
If you can run the example code (e.g., example 1 in the users' guide) without error message, congratulations! Your installation of PROC LCA/LTA is successful. If you have a problem at step (4), please send an email to MChelpdesk@psu.edu describing the error message as well as information regarding your operating system and SAS version. If you have a problem at step (6), please read the next item for a solution.
If you get error messages like “ERROR: Procedure LCA not found,” the most likely cause is that the PROC LCA files were put in a place where your version of SAS cannot find them. The following steps have fixed most installation problems.
1. Use your search to see if the following five files are stored anywhere on your computer:
- msvcr71.dll (not present for 64-bit SAS installations)
Assuming that the download of the installer went well and you extracted the installer, these files should be saved somewhere on your PC. NOTE: Your search default may search only a portion of your hard drive. Please alter your search settings to search for items on ALL drives.
2. Use search to find the file SAS.exe
NOTE: SAS.exe is probably located in one of the following folders:
C:\Program Files\SAS\SAS 9.x,
C:\Program Files\SAS\SASFoundation\9.x, or
3. Put the following three files in the folder where the file SAS.exe is located.
- msvcr71.dll (32-bit SAS only)
4. Put the following two files in the \core\sasexe\ subdirectory, which will be in the folder with the file SAS.exe
(the folder you identified in step 2).
NOTE: If you are able to use one of the folders listed in step 2, the corresponding subdirectories will be
C:\Program Files\SAS\SAS 9.x\core\sasexe\,
C:\Program Files\SAS\SASFoundation\9.x\core\sasexe\, or
If the above steps still don’t fix this installation problem, please send an email to MChelpdesk@psu.edu. In your email, please
(1) copy/paste the error message you received,
(2) describe your SAS version (e.g., SAS9.2, 32 bit), and
(3) describe your operating system (e.g., Windows 7 64 bit). For information regarding your operating system, issue the following command in SAS,
"%put SAS SYSSCPL=&SYSSCPL., SYSVLOG=&SYSVLONG;"
and send us the output (in the log window) of this command.
Not at this time. PROC LCA and PROC LTA are only available for SAS V9 for Windows.
Using PROC LCA/LTA or LCA Stata Plugin
The limit on the number of indicators PROC LCA or PROC LTA can handle is 999. However, as you add indicators the size of the contingency table (and often model complexity) increases substantially. Depending on the model and data, it may be possible to include up to 20 or 30 indicators, or even more. If you get an error message about the estimation engine not being able to fit the saturated model when you have a large number of indicators and/or response categories, see below.
The limit on the number of measurement occasions that PROC LTA can handle is 99. However, practical limitations driven by the amount of memory on the computer and sparseness of the raw data contingency table exist. For these practical reasons, typical users specify five or fewer occasions of measurement.
The AIC and BIC are penalized log-likelihood (LL) information criteria. The penalty for the AIC is two times the number of estimated parameters, and the penalty for the BIC is log(N) times the number of estimated parameters. In PROC LCA, the G2 statistic is penalized. The G2 equals -2(LL overall model – LL saturated model). Because the LL for the saturated model is a constant for a given data set, basing the AIC and BIC on this statistic is equivalent to basing it on -2LL of the overall model. Because the AIC and BIC are based on the G2 statistic, they will not necessarily match those from other software packages if they are based on the LL.
Yes. Standard errors are now provided where possible in the current version PROC LCA (but not PROC LTA). They can also be saved to a dataset using the OUTSTDERR option. Given the estimated standard errors of parameters, one can readily calculate the confidence intervals.
Yes. Posterior probabilities of class or status membership are available by using the OUTPOST option in PROC LCA and PROC LTA. See the users' guide for details on the syntax. If the goal is to assign individuals to a latent class based on their predicted probabilities and link class membership to outcomes, note that this approach does not incorporate the uncertainty of class membership into the analysis, thus biasing inference.
Although the OUTPOST statement generates posterior probabilities, there is no expected frequency table output. It is not possible to get the frequency distribution of class assignment from the PROC LCA output. However, interested users can calculate them easily by saving the posterior probabilities to a SAS data file using the OUTPOST option. Then, in a SAS data step, individuals can be assigned to the class with the highest posterior probability. In fact, there are numerous straight-forward calculations one may wish to do using the posterior probabilities, including the calculation for entropy and the odds of correct classification (as described in Nagin, 2005).
Significance tests are conducted as follows:
- Fit a model with all covariates of interest; save loglikelihood as 'loglik'
- For each covariate, fit the model with that covariate removed; save loglikelihood as 'll_beta_test' (the rest of the steps are done for each covariate)
- delta2ll = ( 2 * loglik ) - ( 2 * ll_beta_test )
- df = ( number of latent classes - 1 ) * ( number of groups ) OR, for binary regression, df = number of groups
- Compute the CDF of a Chi-square distribution for the value 'delta2ll', with 'df' degrees of freedom. This is the p-vale for that covariate.
Version 1.2.6 or higher of PROC LCA can account for sampling weights and clusters. PROC LTA cannot incorporate sampling weights at this time. If you are currently running an older version of PROC LCA and wish to upgrade, you can download the latest version here.
Yes, PROC LCA and PROC LTA handle missing data on the indicators so that you can make use of all the data you have. In other words, it is not necessary to delete cases that have partial data. You just need to code all missing data as SAS system missing (.). Missing data are handled with a full-information maximum likelihood (FIML) technique. Keep in mind that this procedure assumes that data are missing at random (MAR). However, even when the MAR assumption is not met, this missing data procedure performs better than casewise deletion. Note, however, that cases missing values on one or more covariates or on the grouping variable are removed from the analysis.
No. Latent trajectories analysis is a different procedure that requires software such as MPlus or SAS Proc Traj.
Typical Error Messages
You may have gotten one of several common error messages. If you do not see the error message below, send an email to MChelpdesk@psu.edu for help with troubleshooting.
'LCA' is not recognized as an internal or external command operable program or batch file. In addition, the window opened C:\WINNT\system32\cmd.exe
This problem is most likely due to a failed installation of one of the two files you need to run PROC LCA. To fix it, redownload the PROC LCA installation file from the Methodology Center's website to your desktop. Then, open the *.zip file and rerun the setup.msi program. Doing so should restart the installer. When the appropriate window appears, click the "Repair" option. Then, finish the installation as directed by the installer. Once you have reinstalled, rerun your SAS PROC LCA program.
(The following errors were reported by the estimation engine.)
Argument to exp function too large
exp for gamma
OCCURRED IN: lta_mstep_beta in MOD lta_estimation
M-step for BETA failed at iteration ***.
OCCURRED IN: run_lta_em_modelfit in MOD lta_estimation
This is a numerical problem during logistic regression. It occurs when one of the beta parameters (coefficients) starts getting very large. One of the gamma parameters has gone to zero. Be sure that you have run the model without covariates to see if the number of latent classes is correct. Alternatively, there may be a problem with the way the covariates have been coded. There may be missingness in the covariates that causes an entire covariate/group combination to drop out.
A computational error message at iteration x.
To explore problems with the specified model, you can estimate the model again, adding a statement to terminate estimation at some iteration smaller than x, for example: MAXITER x-1 where x-1 is the number of the last successful iteration prior to the error message. This may provide a clue as to parameters that are hitting boundaries (in particular, look for estimates approaching zero). You may wish to produce an optional output file (e.g., using the OUTPARAM option) so that you can examine parameter estimates with greater precision.
Send an email to: MChelpdesk@psu.edu.
Overview of LCA and LTA
Latent class analysis (LCA) is a modeling technique based on the idea that individuals can be divided into subgroups based on an unobservable construct. The construct of interest is the latent variable and the subgroups are called latent classes. True latent class membership is unknown for each individual due to measurement error, but we infer an individual’s membership by measuring the construct with multiple indicators. The indicators are typically categorical; when indicators are continuous we typically refer to it as latent profile analysis (LPA). The latent classes are assumed to be mutually exclusive and exhaustive. Thus, each individual belongs to one and only one latent class, but we are not certain which class due to measurement error.
LCA typically uses cross-sectional data to identify subgroups at a single time point; in this sense we think of class membership as being static. Latent transition analysis (LTA) is an extension of LCA used with longitudinal data where individuals transition between latent classes over time; in this sense we think of class membership as being dynamic and class membership represents a developmental stage. In LTA, development is represented as movement through the stages over time and the technique is particularly well-suited to testing stage-sequential developmental theories (e.g., the transtheoretical model); different individuals may take different paths through the stages.
PROC LCA, PROC LTA, and WinLTA require categorical manifest variables to measure categorical latent variables.
Latent variables are unobserved variables that are measured by multiple observed items, also called manifest variables. For instance, to examine substance use onset as a latent variable, multiple observed items measuring behaviors such as alcohol use, cigarettes use, marijuana use, etc. may be used as indicators or manifest variables.
This answer depends on the kind of model you wish to fit. If you are interested in determining the number of latent classes at a single measurement occasion, or in identifying and describing the number of latent classes at a single measurement occasion, you are interested in LCA and therefore cross-sectional data are all that you need. You can fit LCAs using a variety of software packages, including PROC LCA, Mplus, and Latent Gold. Instead, if you are interested in estimating transitions between latent classes over time, you are interested in LTA and therefore you require longitudinal data (i.e., two or more measurement occasions). You can fit LTAs using PROC LTA and Mplus.
In a special type of LCA, called a repeated measures LCA, longitudinal data may be used with LCA so that the latent classes represent trajectories across multiple measurement occasions. For more information about this special type of LCA, see Lanza and Collins (2006).
Lanza, S. T., & Collins, L. M. (2006). A mixture model of discontinuous development in heavy drinking from ages 18 to 30: The role of college enrollment. Journal of Studies on Alcohol, 67, 552-561.
You can refer to the reading list, where introductory and advanced papers are suggested.
For a concise discussion comparing LCA to K-means cluster analysis, see the following article:
Magidson, J., & Vermunt, J. K. (2002). Latent class models for clustering: A comparison with K-means. Canadian Journal of Marketing Research, 20, 37-44.
Available from Statistical Innovations at http://www.statisticalinnovations.com/articles/cjmr.pdf
Longitudinal latent class analysis (LLCA) and latent transition analysis (LTA) are two different approaches to modeling change over time in a construct that is discrete, as opposed to continuous. (Very often, continuous change over time is modeled using growth curve analysis, such that the population mean level is estimated as a smooth function of time.) Discrete change may be quantified using a single indicator of the outcome at each assessment time point, or using multiple indicators at each time point. Multiple indicators would be used to measure latent class membership at each time point.
LTA estimates latent class membership at time t+1 conditional on membership at time t; in other words, individuals' probabilities of transitioning from a particular latent class at time t to another latent class at time t+1 are estimated. LTA is a Markov model, estimating transitions from Time 1 to Time 2, Time 2 to Time 3, and so on. This allows one to estimate, for example, the probability of membership in a Heavy Drinking latent class at Time 2 given that one belonged to the Non-User latent class at Time 1.
In contrast, LLCA - also referred to as repeated-measures latent class analysis (RMLCA) - is a latent class model where the indicators of the latent class include one or more variables assessed at multiple time points. In concept, this approach is analogous to growth curve modeling in that patterns of responses across all time points are characterized, except that in LCCA change over time is discrete. Lanza & Collins (2006) present an introduction to LCCA where patterns of heavy drinking across six time points are examined, and membership in those developmental patterns are predicted from college enrollment.
The book by Collins & Lanza (2010) describes the differences between LLCA and LTA in greater detail (see Chapter 7).
Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.
Lanza, S. T., & Collins, L. M. (2006). A mixture model of discontinuous development in heavy drinking from ages 18 to 30: The role of college enrollment. Journal of Studies on Alcohol, 67, 552-561. View abstract
Fitting Your Model
This depends on a number of factors, such as the overall size of the contingency table and how saturated the model is. In our experience, LTA works best with sample sizes of at least 300 or greater. Larger samples may be needed for some problems, particularly those where many indicators are involved or the model is complex. Remember that in many cases it is reasonable (and often desirable) to restrict measurement error (i.e., constrain rho parameters to be equal) across times and/or groups.
We recommend creating composite variables for use as indicators. This is often a great way to obtain good manifest indicators.
In order for parameter estimation to proceed properly, there must be enough independent information available from data to produce the parameter estimates. Identification problems tend to occur under the following conditions: a lot of parameters are being estimated; the sample size is small in relation to the maximum possible number of response patterns; or the rho parameters are close to 1/number of response categories rather than closer to zero or one. Often, adding reasonable parameter restrictions in order to reduce the number of parameters being estimated will help to achieve an identified model.
A model should never have a negative G2. A negative G2 value signals that something is wrong, often with the input data. For example, a negative G2 value may be caused by a mistake in reading in the data, such as inputting data as response pattern proportions rather than integer frequencies.
A model should never have negative degrees of freedom. Degrees of freedom (df) are equal to the number of possible cells (k) minus the number of parameters estimated (p) minus one (df=k-p-1). A model will have negative degrees of freedom when the model is trying to estimate more parameters than it is possible to estimate. If you have negative degrees of freedom, reduce the number of latent classes or latent statuses, or add parameter restrictions to reduce the number of parameters being estimated.
No, you do not have to impose equality constraints. However, it is often a good idea to do so, because this keeps the meaning of the latent statuses the same over time. This corresponds to the idea of factor invariance in factor analysis. Sometimes, however, you may want to explore the latent class structure separately for each time to get a sense of what underlying groups there may be in your population. You also may want to model multiple times together without restricting measurement to be equal over time. In this case, the number of classes generally has to be equal unless you use a flexible structural equation modeling package that allows you to condition class membership at time 2 on class membership at time 1. One thing to keep in mind is that if a latent class does not exist at time 1 but does at time 2, it is okay for the class membership probability for that class to be (essentially) zero at time 1, with people transitioning into it at time 2. This could be substantively interesting. Also keep in mind that if you allow measurement error to vary across time, it is a good idea to run multiple sets of starting values because identification may be difficult with these larger models.
Run times increase exponentially as you add time points, especially when there are missing data and/or when there are many indicator variables with many levels. When there are missing data, run times can get very long for large numbers of indicators and/or levels per indicator. If your indicators have many levels, it may help to re-code your indicators so that they have 2 or 3 levels. Alternatively, if there are not very many individuals with missing data, you could try removing those cases, if it is appropriate.
Selecting Your Model
PROC LCA and PROC LTA provide the likelihood-ratio chi-square statistic, denoted G2, as well as the AIC and BIC information criteria. Goodness-of-fit, or absolute model fit, can be assessed by comparing the observed response pattern proportions to the response pattern proportions predicted by the model. If the model as estimated is a good representation of the data, then it will predict the response pattern proportions with a high degree of accuracy. A poor model will not be able to reproduce the observed response pattern proportions very well. The G2 statistic expresses the correspondence between the observed and predicted response patterns. For ordinary contingency table models the G2 is distributed as a chi-squared; unfortunately, for large contingency table models common in LTA, the chi-square becomes an inaccurate approximation of the G2 distribution. A very rough rule of thumb is that a good model has a goodness-of-fit statistic (G2 value) lower than the degrees of freedom. Relative model fit (that is, deciding which of several models is optimal in terms of balancing fit and parsimony) can be assessed with the AIC and BIC. Models with lower AIC and BIC are optimal.
AIC and BIC are both penalized-likelihood criteria. AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Each, despite its heuristic usefulness, has therefore been criticized as having questionable validity for real-world data. But despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC. In general, it might be best to use AIC and BIC together in model selection. For example, in selecting the number of latent classes in a model, if BIC points to a three-class model and AIC points to a five-class model, it makes sense to select from models with 3, 4 and 5 latent classes. AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.
Interpreting Your Model
You are probably familiar with another latent variable model, factor analysis. In factor analysis, a manifest variable's loading on a factor represents the relation between the variable and the factor. Because factor loadings are regression coefficients, a factor loading of zero represents no relation between the manifest variable and the factor, whereas larger factor loadings reflect a stronger relation. In other words, all else being equal, when factor loadings are large the latent variable is being measured better. In latent class and latent transition models, rho parameters play the same conceptual role as factor loadings; however, they are NOT regression coefficients, so they are scaled differently and their interpretation is somewhat different. Rho parameters are probabilities. The closer the rho parameters are to zero and one, the more the responses to the manifest variable are determined by the latent class or latent status. The closer the rho parameters are to 1/(number of response alternatives) - this is .5 for binary variables - the weaker the relation between the manifest variable and the latent class/status. In other words, all else being equal, when rho parameters are close to zero and one, the latent variable is being measured better. Another consideration is the overall pattern of rho parameters. Ideally, the pattern of rho parameters clearly identifies latent classes/statuses with distinguishable interpretations. This is similar to the concept of simple structure in factor analysis.
Cross-validation can be an alternative to traditional goodness-of-fit testing when there are several plausible models to be compared, and there are problems associated with the distribution of the G2 fit statistic, such as when the sample size is small. Crossvalidation involves splitting a sample into two (or more) subsamples, for example, Sample A and Sample B, and fitting a series of plausible models to each sample. Each model is fitted to Sample A (the calibration sample), the predicted response frequencies for each model are compared to the observed response frequencies in Sample B (the crossvalidation sample), and G2 is computed. Then the reverse is done; each model is fitted to Sample B (now the calibration sample), the predicted response frequencies for this model are compared to the observed response frequencies in Sample A (now the crossvalidation sample), and another G2 is computed. A model crossvalidates well if the G2 is relatively small when the estimated model is applied to a crossvalidation sample. When a series of models is tested, the model or models that crossvalidate best are considered best-fitting.
One straight-forward approach is to assign individuals to the latent class in which they have the highest posterior probability of membership (these probabilities can be saved in a SAS data file using the OUTPOST option). Then, relationships among all indicators of your latent class variable can be explored separately for each group of individuals (i.e., for each class).
More sophisticated (and statistically sound) ways to explore local dependence have been explored. One of these procedures is outlined by Bandeen-Roche, Migloretti, Zeger, and Rathouz (1997), where they multiply impute latent class membership and look for violations of this assumption within each imputation. This procedure requires Bayesian estimation, which will be included in a future release of PROC LCA and PROC LTA. For now, the WinLTA software (available at http://methodology.psu.edu/downloads/winlta) can be used to impute the latent class variable for models with no covariates.
Can I assign individuals to latent classes or latent statuses based on their posterior probabilities? We do not recommend assigning individuals to latent classes or latent statuses based on their posterior probabilities unless there is no viable alternative. By assigning individuals to latent classes or latent statuses, you introduce error into your results. There are many different types of analyses that can be performed within the latent class modeling framework (e.g., predicting latent class membership) without having to assign individuals to latent classes or latent statuses. When possible, we recommend working within the latent class modeling framework because it incorporates measurement error into the model, which is ignored by class/status assignment. If you are planning to assign individuals based on posterior probabilies, one article of interest might be:
Goodman, L. A. (2007). On the assignment of individuals to latent classes. Sociological Methodology, 37(1), 1-22. doi: 10.1111/j.1467-9531.2007.00184.x
In this paper, Goodman describes two ways to assign individuals and two criteria that can be used to assess when class assignment is satisfactory and when it is not. If you assign individuals to classes/statuses, we recommend evaluating the amount of measurement error introduced by doing so.
Return to top of page for PROC LCA & PROC LTA FAQ