# [Minutes] Fall 2016 Meeting # 3 - 09/20/16 - Factor Analysis Data Presentation

## [Minutes] Fall 2016 Meeting # 3 - 09/20/16 - Factor Analysis Data Presentation

Today we wrapped up our discussion of factor analysis with a discussion of an applied, real-life example, courtesy of Catie Walsh. Below is a brief overview of her project and the major points that were brought up during her presentation.

Catie was interested in modeling psychological distress from a set of 14 items drawn from three separate scales. Although we only touched briefly on this aspect of the project, these were repeated measures, and so she is dealing with the extra complication of identifying factors that are replicated in the same sample across time (e.g., so if there are two factors of depression and hostility, for example, do these same factors come out at each time point?). Today, we mostly discussed how to determine the appropriate number of factors to use and didn't get to discuss too much about measurement invariance - but that could be a topic for later meetings if the interest is there!

There are a number of quantitative approaches to determining how many factors you use when you are running an EFA, which Aidan nicely put as "tools, not rules" to determining the appropriate factor structure. Before we talked about the specific metrics, though, we discussed the relevance of deciding how many factors to use. Generally, it is better to over-factors, or identify too many factors, than to under-factor, or identify too few factors. With over-factoring, you many pull out factors that are really just a single item or that may not replicate, but what happens is often the you divide what is truly a single latent construct into multiple factors. The more problematic situation would be failing to distinguish between distinct factors, such as if two constructs are not separated into different factors. This could happen, going back to Catie's example, if you had latent constructs of depression, anxiety, and hostility - if you go with a two-factor model, depression and anxiety might load on the same factor (and I'm not a clinical researcher so maybe this is actually what you would want...?), making it difficult to isolate how these latent constructs uniquely operate.

Getting to the actual metrics, we talked about 4 different approaches today, the first three of which deal with eigenvalues, which help us understand how much of the common variance across a set of items is explained by a factor. One guideline for determining the number of factors is to choose factors that have eigenvalues above 1, as this means that the factor explains more of the common variance than would a single itemand thus is considered (by this standard) worth retaining.

The second approach is related to this - rather than looking at the value of individual eigenvalues, you plot them in what is called a Scree plot (with factor # on the X axis and eigenvalue on the Y) to determine where the line through these points "curves." Eigenvalues will drop as you add more factors, given that each factor can explain less and less of the remaining shared variance, but you want to see where this drop-off stabilizes, in a sense. This is pretty subjective but is meant to identify at what point (what number of factors) you see factors explaining substantially less of the shared variance.

The third approach examining eigenvalues is called parallel analysis and basically tries to estimate whether the addition of the factor explains more of the common variance among these (presumably) interrelated variables than a second, third, etc. factor would in random, simulated data. If additional factors would explain the same amount of additional variance among variables that were not actually related, the argument is that this additional factor is not picking up on theoretically important shared variance in your data either. In a way, it's just providing a more realistic set of cutoffs for each factor than just "above 1" as I mentioned above. Catie had a nice example in her slides of what this looks like, where you get eigenvalues for factors 1, 2, and 3 in the real data and comparison eigenvalues for randomly generated data for those same factors. One point to note here, though, is that these analyses tell you up to how many factors to consider, but you can always trim more based on interpretation and theory. These guidelines also don't need to be so strict. In Catie's example, factor 2 had an eigenvalue of 1.428 in the real data and 1.430 in the simulated data. Even though the simulated data was higher (so factor 2 explained slightly more common variance in the simulated data than it did in the observed data), you might want to consider a two-factor model still because it's so close.

The last approach Catie showed and we discussed briefly was to look at model fit (in an SEM framework). Often with a lot of items you will get poor fit statistics because you are losing so much of the variance in the 14 items by trying to combine them into two or three scales. You can still look at model comparisons, though, which would tell you if adding an additional factor significantly improves the fit of the model. Like with eigenvalues, which were always positive and non-zero, model fit will always be improved by adding additional factors - the point is to identify if the improvement is enough - is it more than would be expected based on a single item (above 1)? Is it as much as was seen with previous factors (scree plot)? Is it more than you would expect by chance (parallel analyses)? Is it enough to significantly improve model fit (model comparisons)? Sometimes (like in Catie's example), all the indicators point to difference suggestions, and so it's up to the researcher to determine how many factors to include (lucky you!) based on interpretation and theory.

So getting to Catie's actual output, we talked about how to interpret each of the patterns of factor loadings from her EFAs. In the one factor model, everything loads highly (above .3, which means the factor explains around 10% of the variability in observed responses), which is a good sign. With the two factor model, the factors that emerge look like method factors - one has items that are descriptive adjectives, and the other has items that are statements. There might also be differences in phrasing that could influence this. With three factors, the third factor really only has one item loading on in, which is a sign that the data have been over-factored. We talked about how it seems to come down to deciding between a one or two factor model. If the method distinctions that are getting pulled apart in the two factor model are relevant and important to you, you would prefer the two factor model. Otherwise, you would probably prefer the one factor model.

Catie also showed us some CFA results - a few things to note: the EFA and CFA have identical fit and factor loadings in a single factor model, as they should be. Once you add in a second factor, some of the loadings have been set to zero for specific factors, which you can see in the table.

Catie's other issue was with figuring out how to model these factors over time, if certain items operate differently and load on the single factor of distress differently with repeated measurements. One suggestion that was discussed, particularly in this scenario with such a large number of items, is to select the items that most clearly and consistently map onto the latent factor of interest. This should be done thoughtfully and intentionally and can take a variety of forms, such as (with Catie's example) choosing the items that load most strongly, picking items that map onto one piece of the construct very clearly (e.g., including just the depression items), or selecting items that span the breadth of the construct (e.g., making sure you have some items addressing depression, anxiety, and hostility).

Thanks again to Catie for presenting (please let me know if I missed anything in describing your project or presentation)! Feel free to comment with questions, corrections, other thoughts, or things I may have missed!

Catie was interested in modeling psychological distress from a set of 14 items drawn from three separate scales. Although we only touched briefly on this aspect of the project, these were repeated measures, and so she is dealing with the extra complication of identifying factors that are replicated in the same sample across time (e.g., so if there are two factors of depression and hostility, for example, do these same factors come out at each time point?). Today, we mostly discussed how to determine the appropriate number of factors to use and didn't get to discuss too much about measurement invariance - but that could be a topic for later meetings if the interest is there!

There are a number of quantitative approaches to determining how many factors you use when you are running an EFA, which Aidan nicely put as "tools, not rules" to determining the appropriate factor structure. Before we talked about the specific metrics, though, we discussed the relevance of deciding how many factors to use. Generally, it is better to over-factors, or identify too many factors, than to under-factor, or identify too few factors. With over-factoring, you many pull out factors that are really just a single item or that may not replicate, but what happens is often the you divide what is truly a single latent construct into multiple factors. The more problematic situation would be failing to distinguish between distinct factors, such as if two constructs are not separated into different factors. This could happen, going back to Catie's example, if you had latent constructs of depression, anxiety, and hostility - if you go with a two-factor model, depression and anxiety might load on the same factor (and I'm not a clinical researcher so maybe this is actually what you would want...?), making it difficult to isolate how these latent constructs uniquely operate.

Getting to the actual metrics, we talked about 4 different approaches today, the first three of which deal with eigenvalues, which help us understand how much of the common variance across a set of items is explained by a factor. One guideline for determining the number of factors is to choose factors that have eigenvalues above 1, as this means that the factor explains more of the common variance than would a single itemand thus is considered (by this standard) worth retaining.

The second approach is related to this - rather than looking at the value of individual eigenvalues, you plot them in what is called a Scree plot (with factor # on the X axis and eigenvalue on the Y) to determine where the line through these points "curves." Eigenvalues will drop as you add more factors, given that each factor can explain less and less of the remaining shared variance, but you want to see where this drop-off stabilizes, in a sense. This is pretty subjective but is meant to identify at what point (what number of factors) you see factors explaining substantially less of the shared variance.

The third approach examining eigenvalues is called parallel analysis and basically tries to estimate whether the addition of the factor explains more of the common variance among these (presumably) interrelated variables than a second, third, etc. factor would in random, simulated data. If additional factors would explain the same amount of additional variance among variables that were not actually related, the argument is that this additional factor is not picking up on theoretically important shared variance in your data either. In a way, it's just providing a more realistic set of cutoffs for each factor than just "above 1" as I mentioned above. Catie had a nice example in her slides of what this looks like, where you get eigenvalues for factors 1, 2, and 3 in the real data and comparison eigenvalues for randomly generated data for those same factors. One point to note here, though, is that these analyses tell you up to how many factors to consider, but you can always trim more based on interpretation and theory. These guidelines also don't need to be so strict. In Catie's example, factor 2 had an eigenvalue of 1.428 in the real data and 1.430 in the simulated data. Even though the simulated data was higher (so factor 2 explained slightly more common variance in the simulated data than it did in the observed data), you might want to consider a two-factor model still because it's so close.

The last approach Catie showed and we discussed briefly was to look at model fit (in an SEM framework). Often with a lot of items you will get poor fit statistics because you are losing so much of the variance in the 14 items by trying to combine them into two or three scales. You can still look at model comparisons, though, which would tell you if adding an additional factor significantly improves the fit of the model. Like with eigenvalues, which were always positive and non-zero, model fit will always be improved by adding additional factors - the point is to identify if the improvement is enough - is it more than would be expected based on a single item (above 1)? Is it as much as was seen with previous factors (scree plot)? Is it more than you would expect by chance (parallel analyses)? Is it enough to significantly improve model fit (model comparisons)? Sometimes (like in Catie's example), all the indicators point to difference suggestions, and so it's up to the researcher to determine how many factors to include (lucky you!) based on interpretation and theory.

So getting to Catie's actual output, we talked about how to interpret each of the patterns of factor loadings from her EFAs. In the one factor model, everything loads highly (above .3, which means the factor explains around 10% of the variability in observed responses), which is a good sign. With the two factor model, the factors that emerge look like method factors - one has items that are descriptive adjectives, and the other has items that are statements. There might also be differences in phrasing that could influence this. With three factors, the third factor really only has one item loading on in, which is a sign that the data have been over-factored. We talked about how it seems to come down to deciding between a one or two factor model. If the method distinctions that are getting pulled apart in the two factor model are relevant and important to you, you would prefer the two factor model. Otherwise, you would probably prefer the one factor model.

Catie also showed us some CFA results - a few things to note: the EFA and CFA have identical fit and factor loadings in a single factor model, as they should be. Once you add in a second factor, some of the loadings have been set to zero for specific factors, which you can see in the table.

Catie's other issue was with figuring out how to model these factors over time, if certain items operate differently and load on the single factor of distress differently with repeated measurements. One suggestion that was discussed, particularly in this scenario with such a large number of items, is to select the items that most clearly and consistently map onto the latent factor of interest. This should be done thoughtfully and intentionally and can take a variety of forms, such as (with Catie's example) choosing the items that load most strongly, picking items that map onto one piece of the construct very clearly (e.g., including just the depression items), or selecting items that span the breadth of the construct (e.g., making sure you have some items addressing depression, anxiety, and hostility).

Thanks again to Catie for presenting (please let me know if I missed anything in describing your project or presentation)! Feel free to comment with questions, corrections, other thoughts, or things I may have missed!

**LeanneElliott**- Moderator
- Posts : 15

Reputation : 2

## Re: [Minutes] Fall 2016 Meeting # 3 - 09/20/16 - Factor Analysis Data Presentation

Thanks Leanne! This is a great summary. My presentation is attached, and my Mplus Syntax is below:

- Code:
`*EFA Extract 1-3 Factors`

Data:

File is C:\12 TP Distress Composite items for factor analysis.csv;

Variable:

Names are

id nervous dumps blue sad hostile edge angry tense unhappy

control handle going pile

;

IDVARIABLE = id;

Missing are all (-999);

USEVARIABLES ARE nervous dumps blue sad hostile edge angry

tense unhappy control handle going pile

;

Analysis:

!type = basic; for data descriptives

type = efa 1 3;

estimator = ml; !default

rotation = geomin; !default geomin rotation

!rotation = promax;

parallel (50); !draws 50 random samples to be compared to sample data

Output: sampstat;

Plot: type = plot2;!for scree plot

*CFA 1-factor

Data:

File is C:\12 TP Distress Composite items for factor analysis.csv;

Variable:

Names are

id nervous dumps blue sad hostile edge angry tense unhappy

control handle going pile

;

IDVARIABLE = id;

Missing are all (-999);

USEVARIABLES ARE nervous dumps blue sad hostile edge angry

tense unhappy control handle going pile

;

Analysis:

Model:

!define distress factors

D1 BY nervous* dumps blue sad hostile edge angry

tense unhappy control handle going pile;

!set factor variance

D1@1;

Output: sampstat STDYX RESIDUAL MODINDICES TECH1;

Plot: type = plot1 plot2 plot3;

*CFA 2-factors

Data:

File is C:\12 TP Distress Composite items for factor analysis-outliers.csv;

Variable:

Names are

id nervous dumps blue sad hostile edge angry tense unhappy

control handle going pile

;

IDVARIABLE = id;

Missing are all (-999);

USEVARIABLES ARE nervous dumps blue sad hostile edge angry

tense unhappy control handle going pile

;

Analysis:

Model:

!define distress factors

D1 BY nervous@0 dumps@0 blue@0 sad hostile edge angry

tense unhappy control@0 handle@0 going@0 pile@0;

D2 BY nervous* dumps blue sad@0 hostile@0 edge@0 angry@0

tense@0 unhappy@0 control handle going pile;

!set factor variance

D1@1;

D2@1;

Output: sampstat STDYX RESIDUAL MODINDICES TECH1;

Plot: type = plot1 plot2 plot3;

- Attachments

Last edited by JeffreyGirard on Wed Sep 21, 2016 7:33 pm; edited 1 time in total (Reason for editing : Added the [code] tags)

**CatieWalsh**- Admin
- Posts : 29

Reputation : 0

Location : OEH sometimes, but mostly wandering around Sennott

## Re: [Minutes] Fall 2016 Meeting # 3 - 09/20/16 - Factor Analysis Data Presentation

Catie, I accidentally down-voted your post and I can't undo it, but your slides and syntax are great! Sorry for my lack of technical savvy-ness

**LeanneElliott**- Moderator
- Posts : 15

Reputation : 2

Page

**1**of**1****Permissions in this forum:**

**cannot**reply to topics in this forum