Advice on cross-classified models with large data sets
- 
				rdmcdowell
- Posts: 31
- Joined: Mon Apr 02, 2012 3:26 pm
Advice on cross-classified models with large data sets
Hello. I was wondering if anyone could advise on their experience trying to fit cross-classified models with large data sets? For example, I'm examining health outcomes in individuals (level 2, n=500,000) over time(level 1), but patients may be seen by different doctors on each occasion, so the data is cross-classified. However, there are over 2,500 doctors, and although the vast majority of patients see only 1 doctor, around 15% see two or more different doctors.
When I try to identify unique clusters of "patient/doctor" I come up with around 3,000 clusters, too many for Stata to cope with (constraints have to be less than 2,000), and even with a subset of the data with fewer constraints and try fitting an appropriate model in runmlwin I get an error "too many macros". Could mlwin cope with fitting this directly?
My initial thoughts were to fit the model using IGLS estimation and then rerun using MCMC to take account of the cross-classification, however unsurprisingly, even testing this out on a very small subsample of the data there are very high levels of autocorrelation that, even with using techniques referred to at the end of the mlwin manuals which can help with mixing, it could take many days to get there using my trusty desktop PC, if at all! My thoughts are that I should really start by reducing the sample size and focusing on the subsample of patients that are most important, but even so, there are going to be thouands of doctors involved and a certain degree of cross-classification. I wondered if anyone had any advice they could share on their experiences? If I try using the multiple membership approach (wide format data, where there is only one doctor per time point with all weights =1, would that fit a cross-classified model or would it be the estimate obtained from the usual IGLS estimation where the cross-classification is not taken account of? Thanks in advance!
			
			
									
						
										
						When I try to identify unique clusters of "patient/doctor" I come up with around 3,000 clusters, too many for Stata to cope with (constraints have to be less than 2,000), and even with a subset of the data with fewer constraints and try fitting an appropriate model in runmlwin I get an error "too many macros". Could mlwin cope with fitting this directly?
My initial thoughts were to fit the model using IGLS estimation and then rerun using MCMC to take account of the cross-classification, however unsurprisingly, even testing this out on a very small subsample of the data there are very high levels of autocorrelation that, even with using techniques referred to at the end of the mlwin manuals which can help with mixing, it could take many days to get there using my trusty desktop PC, if at all! My thoughts are that I should really start by reducing the sample size and focusing on the subsample of patients that are most important, but even so, there are going to be thouands of doctors involved and a certain degree of cross-classification. I wondered if anyone had any advice they could share on their experiences? If I try using the multiple membership approach (wide format data, where there is only one doctor per time point with all weights =1, would that fit a cross-classified model or would it be the estimate obtained from the usual IGLS estimation where the cross-classification is not taken account of? Thanks in advance!
Re: Advice on cross-classified models with large data sets
Hi RDMCDOWELL,
If you fit a cross-classified model then it's best to use MCMC in MLwiN and so you would have higher levels/classifications for doctors and patients. I wouldn't suggest you use the clustering as described. I think the IGLS method would really struggle. The multiple membership approach only makes sense if the patient is seen by more than one doctor on the same occasion.
If MCMC is producing correlated chains you might want to look at some of the additional MCMC methodology options in the late chapters of my MCMC manual.
Best wishes,
Bill.
			
			
									
						
										
						If you fit a cross-classified model then it's best to use MCMC in MLwiN and so you would have higher levels/classifications for doctors and patients. I wouldn't suggest you use the clustering as described. I think the IGLS method would really struggle. The multiple membership approach only makes sense if the patient is seen by more than one doctor on the same occasion.
If MCMC is producing correlated chains you might want to look at some of the additional MCMC methodology options in the late chapters of my MCMC manual.
Best wishes,
Bill.
- 
				rdmcdowell
- Posts: 31
- Joined: Mon Apr 02, 2012 3:26 pm
Re: Advice on cross-classified models with large data sets
Many thanks Bill for that advice. I will just have to be patient then with the MCMC running!
			
			
									
						
										
						- 
				rdmcdowell
- Posts: 31
- Joined: Mon Apr 02, 2012 3:26 pm
Re: Advice on cross-classified models with large data sets
I have found that MCMC estimation will work well for these models, provided I am careful about using informative priors. Thankfully I have similar analyses which I can use from another dataset which I can use to define these, and am using the examples from Chapter 5 in the manual to set these up in runmlwin.
One problem I have is when defining the matrix P for the informative priors. When setting the prior matrix using the command matrix P=(.*b \ .*b), the last column of P is titled OD: bcons_1 , not RP1 (non-normal outcome) and stata objects when executing runmlwin "equation RP1 not found". Is there any way I can change the "equation" in the last column to RP1 from OD? Thank you.
			
			
									
						
										
						One problem I have is when defining the matrix P for the informative priors. When setting the prior matrix using the command matrix P=(.*b \ .*b), the last column of P is titled OD: bcons_1 , not RP1 (non-normal outcome) and stata objects when executing runmlwin "equation RP1 not found". Is there any way I can change the "equation" in the last column to RP1 from OD? Thank you.
Re: Advice on cross-classified models with large data sets
Hi rdmcdowell,
Am afraid I didn't write runmlwin so can't help here though I'll mention this to George Leckie and Chris Charlton to reply to you,
Best wishes,
Bill.
			
			
									
						
										
						Am afraid I didn't write runmlwin so can't help here though I'll mention this to George Leckie and Chris Charlton to reply to you,
Best wishes,
Bill.
- 
				ChrisCharlton
- Posts: 1390
- Joined: Mon Oct 19, 2009 10:34 am
Re: Advice on cross-classified models with large data sets
The short answer is that you can change these equation names with the  command (see https://www.stata.com/help.cgi?matrix+rownames). I am not however sure why this would be necessary so will investigate further.
			
			
									
						
										
						Code: Select all
matrix coleq- 
				ChrisCharlton
- Posts: 1390
- Joined: Mon Oct 19, 2009 10:34 am
Re: Advice on cross-classified models with large data sets
I have now looked into this further and it appears that the extra "bcons" parameters for non-normal models, as well as empty levels were not being handled correctly within the informative priors settings. I have attached a new version of the command where this should be fixed. We will do some more testing of this and provide a new release once we have confirmed that it now works correctly.
			
							- Attachments
- 
			
		
		
				- runmlwin.ado
- (264.16 KiB) Downloaded 681 times
 
- 
				rdmcdowell
- Posts: 31
- Joined: Mon Apr 02, 2012 3:26 pm
Re: Advice on cross-classified models with large data sets
Thanks for that very prompt response! I downloaded that version of runmlwin and installed it, though same came up with the same issue. As it stands the last column in my matrix of priors is called OD:bcons_1. I've tried renaming it to RP1:var(bcons_1) and RP1:var(cons) but am still getting the error "equation RP1 not found". What should I rename it to?
			
			
									
						
										
						- 
				ChrisCharlton
- Posts: 1390
- Joined: Mon Oct 19, 2009 10:34 am
Re: Advice on cross-classified models with large data sets
Did you run a  command after updating your version of runmlwin.ado? The following example works for me, could you confirm whether it does for you too?:
			
			
									
						
										
						Code: Select all
discardCode: Select all
use http://www.bristol.ac.uk/cmm/media/runmlwin/bang1.dta, clear
runmlwin use cons age, ///
	level2(district: ) ///
	level1(woman:) ///
	discrete(distribution(binomial) link(logit) denom(cons)) nopause
estimates store IGLS
matrix b = e(b)
matrix V = e(V)	
matrix P = (.*b \ .*b)
matrix rownames P = mean sd
matrix list P
matrix P[1,2] = 1
matrix P[2,2] = .01
matrix list P
runmlwin use cons age, ///
	level2(district: ) ///
	level1(woman:) ///
	discrete(distribution(binomial) link(logit) denom(cons)) ///
	mcmc(priormatrix(P)) initsprevious nopause- 
				rdmcdowell
- Posts: 31
- Joined: Mon Apr 02, 2012 3:26 pm
Re: Advice on cross-classified models with large data sets
Thanks! I restarted Stata and everything seems to be working well now, which is great!
On a slight aside, I see from the MCMC manual you can export code to Winbugs if you wish to tweak the priors from the available options. Does this mean then that the revised code has to be run in winbugs and can't be reimported into Mlwin? For example, in my models I have found using informative priors works well for the fixed effects associated with continuous predictors, and for the random effects. However the default improper uniform prior performs poorly for the fixed effects associated with dummy variables, and it seems I can only get estimates for these which are of similar magnitude to those obtained using IGLS if I am quite restrictive with the SD.
			
			
									
						
										
						On a slight aside, I see from the MCMC manual you can export code to Winbugs if you wish to tweak the priors from the available options. Does this mean then that the revised code has to be run in winbugs and can't be reimported into Mlwin? For example, in my models I have found using informative priors works well for the fixed effects associated with continuous predictors, and for the random effects. However the default improper uniform prior performs poorly for the fixed effects associated with dummy variables, and it seems I can only get estimates for these which are of similar magnitude to those obtained using IGLS if I am quite restrictive with the SD.