Nested covariates and implication for ML model structure
Posted: Mon Apr 17, 2023 12:03 pm
I have been asked to run a model of student exam data, with multiple entries per student. The repeated measures will form the lower level, and the students the higher level, of a 2-level hierarchy.
I have various student-level variables that I need to include: gender, ethnicity etc. I also need to include department and faculty information for each student (at this institution, there are 6 faculties, and each is split into between 2 and 4 departments).
My understanding is that it it not valid to specify department and faculty as levels 3 and 4 of my hierarchy, as both are fixed classifications. (We are not currently proposing to generalise findings beyond this institution.) I can add them as additional student-level terms in the same way that I add gender, ethnicity etc. (i.e. with one reference category specified and using indicator variables)... but the department/faculty data is nested. Faculty A contains only departments 1, 2 and 3; faculty B contains only departments 4 and 5, and so on.
I wondered what (if any) the implications of this nesting are - can the indicator variables modelling department and faculty be added to the model in the same way that other student-level variables such as gender are added in MLwiN? Or is there a better way?
It also occurs to me that if we decide to generalise our findings beyond our own institution, then both department and faculty become random classifications - and in this situation, might be considered to be levels 3 and 4 of the model, dispensing with the need to model them as fixed effects using indicator variables?
A possible further complication: currently I have data supplied to me from only 1 of the 6 faculties. It's a safe bet that I will get data from 3 or 4 in total, but there is a good chance that 1 or 2 faculties will not supply data at all. In which case, I am back to thinking that both faculty and department are random classifications... except that the population of units (6) will not be very much larger than the number of units sampled (say 4).
So in summary, the questions I am asking myself are:
1. Does the nesting of certain student-level factors treated as fixed classifications (department within faculty) matter?
2. If I decide to generalise my own institutions' data beyond its boundaries, can I consider department and faculty to be levels, rather than fixed classification variables, in the model?
3. If I do not receive data from all the faculties in my institution, even if I do not generalise beyond the institution, again, can I consider department and faculty to be levels, rather than fixed classification variables, in the model?
Any insights on any of this would be much appreciated, and apologies if I have missed something obvious.
Many thanks
John
I have various student-level variables that I need to include: gender, ethnicity etc. I also need to include department and faculty information for each student (at this institution, there are 6 faculties, and each is split into between 2 and 4 departments).
My understanding is that it it not valid to specify department and faculty as levels 3 and 4 of my hierarchy, as both are fixed classifications. (We are not currently proposing to generalise findings beyond this institution.) I can add them as additional student-level terms in the same way that I add gender, ethnicity etc. (i.e. with one reference category specified and using indicator variables)... but the department/faculty data is nested. Faculty A contains only departments 1, 2 and 3; faculty B contains only departments 4 and 5, and so on.
I wondered what (if any) the implications of this nesting are - can the indicator variables modelling department and faculty be added to the model in the same way that other student-level variables such as gender are added in MLwiN? Or is there a better way?
It also occurs to me that if we decide to generalise our findings beyond our own institution, then both department and faculty become random classifications - and in this situation, might be considered to be levels 3 and 4 of the model, dispensing with the need to model them as fixed effects using indicator variables?
A possible further complication: currently I have data supplied to me from only 1 of the 6 faculties. It's a safe bet that I will get data from 3 or 4 in total, but there is a good chance that 1 or 2 faculties will not supply data at all. In which case, I am back to thinking that both faculty and department are random classifications... except that the population of units (6) will not be very much larger than the number of units sampled (say 4).
So in summary, the questions I am asking myself are:
1. Does the nesting of certain student-level factors treated as fixed classifications (department within faculty) matter?
2. If I decide to generalise my own institutions' data beyond its boundaries, can I consider department and faculty to be levels, rather than fixed classification variables, in the model?
3. If I do not receive data from all the faculties in my institution, even if I do not generalise beyond the institution, again, can I consider department and faculty to be levels, rather than fixed classification variables, in the model?
Any insights on any of this would be much appreciated, and apologies if I have missed something obvious.
Many thanks
John