Number of observations displayed

Welcome to the forum for R2MLwiN users. Feel free to post your question about R2MLwiN here. The Centre for Multilevel Modelling take no responsibility for the accuracy of these posts, we are unable to monitor them closely. Do go ahead and post your question and thank you in advance if you find the time to post any answers!

Go to R2MLwiN: Running MLwiN from within R >> http://www.bris.ac.uk/cmm/software/r2mlwin/
Post Reply
adeldaoud
Posts: 63
Joined: Sat Aug 15, 2015 4:00 pm

Number of observations displayed

Post by adeldaoud » Tue Apr 17, 2018 11:30 pm

Hi

The mymodel@Hierarchy slot does not get the correct number of intermediate levels (e.g. class id in a data of students nested in class nested in schools) when these intermediate unit id are renumbered.

For example:

School class student
1 1 1
1 1 2
1 2 1
1 2 2
2 1 1
2 1 2
2 2 1
2 3 1


The reason I would like to have these intermediate level id restarting is that R2mlwin is faster in processing the data. In debuggmode, the Hierarchy viewer displays the correct nr of units.

Any ideas of how to solve this?

ChrisCharlton
Posts: 1018
Joined: Mon Oct 19, 2009 10:34 am

Re: Number of observations displayed

Post by ChrisCharlton » Wed Apr 18, 2018 9:11 am

I would have to check, but it's possible that the R display is using unique codes instead of changes to the ID value as in MLwiN to determine different units. If you renumber the classes so that each has a unique code across schools is the hierarchy reported correctly?

adeldaoud
Posts: 63
Joined: Sat Aug 15, 2015 4:00 pm

Re: Number of observations displayed

Post by adeldaoud » Wed Apr 18, 2018 12:08 pm

Hi Chris,
I would have to check, but it's possible that the R display is using unique codes instead of changes to the ID value as in MLwiN to determine different units.
Ok, thanks, that would be great. Again, I could, of course, use unique id, but performance is noticeably slower with large datasets.

If you renumber the classes so that each has a unique code across schools is the hierarchy reported correctly?
Yes, it is reposted correctly when I use unique id.

ChrisCharlton
Posts: 1018
Joined: Mon Oct 19, 2009 10:34 am

Re: Number of observations displayed

Post by ChrisCharlton » Wed Apr 18, 2018 12:28 pm

Could you show an example where the ID code makes a difference to the estimation time? If the model is the same then I can't see why it would make a difference to MLwiN. It might potentially make a difference in R while calculating the hierarchy, but then this would be due to the calculation being wrong in the case of non-unique codes. R2MLwiN can take advantage of the reshape (see https://cran.r-project.org/web/packages ... index.html) package to speed up the hierarchy calculation.

adeldaoud
Posts: 63
Joined: Sat Aug 15, 2015 4:00 pm

Re: Number of observations displayed

Post by adeldaoud » Wed Apr 18, 2018 11:06 pm

Chris, I ran a test testing the difference between the two id styles. I used the data given in LEMMA section 11.2: A Three-Level Variance Components Model using the the STATA data 11.2.dta.

Code: Select all

library(foreign)
tut <- read.dta("11.2.dta")


# run with unique Id
unique.id.performance <- rep(NA, 10)
for(i in 1:10) {
  start <- Sys.time()
  mymodel2 <- runMLwiN(postthks ~ 1 +  (1 | schoolid) + (1 | classid)+ (1 | studentid), data = tut, 
                       estoptions = list(EstM = 0, optimat=T,  resi.store = F, debugmode=F), MLwiNPath = mlwinpath)
  end <- Sys.time()
  start - end
  unique.id.performance[i] <- end - start
  #return(within.id.performance)
}

mean(unique.id.performance)

# create a repeated id
library(dplyr)

# create a a within-class id
tut1 <- tut %>%
  group_by(schoolid, classid) %>%
  mutate(studentid_Rep = row_number())

tut2 <- tut %>%
  select(schoolid, classid) %>%
  distinct() %>%
  group_by(schoolid) %>%
  mutate(classid_Rep = row_number())

# merge back
tut3 <- tut1 %>%
  left_join(tut2, by = c("schoolid" = "schoolid", "classid"= "classid"))

# create a comparable dimensioned data frame
tut4 <- tut3 %>%
  ungroup() %>%
  select(-classid, -studentid)

within.id.performance <- rep(NA, 10)
for(i in 1:10) {
  
  start1 <- Sys.time()
  mymodel3 <- runMLwiN(postthks ~ 1 +  (1 | schoolid) + (1 | classid_Rep)+ (1 | studentid_Rep), data = tut4, 
                       estoptions = list(EstM = 0, optimat=T,  resi.store = F, debugmode=F), MLwiNPath = mlwinpath)
  end1 <- Sys.time()
  
  within.id.performance[i] <- end1 - start1
  #return(within.id.performance)
}

# diff
mean(within.id.performance) - mean(unique.id.performance)
I ran the experiment 10 times each to account for PC performance discrepancies with background apps. With this data (1600 rows and 11 covariates) using unique id is about 0.03 seconds slower. With my data which often has 1 million rows, performance slowdown is noticeable.

Curious to see if you get similar differences on your machine.

ChrisCharlton
Posts: 1018
Joined: Mon Oct 19, 2009 10:34 am

Re: Number of observations displayed

Post by ChrisCharlton » Fri Apr 20, 2018 2:02 pm

Thanks for this, yes I do see a difference in run time between the two:

Code: Select all

> within.id.performance
 [1] 0.3839951 0.3889999 0.3940001 0.3780010 0.4100001 0.3949988 0.4060080 0.3809931 0.3759990 0.3880091
> unique.id.performance
 [1] 0.7130022 3.5269980 0.4020000 0.4060030 0.4049971 0.4170029 0.4039981 0.4229989 0.4150012 0.4019990
However, the estimation time is the same:

Code: Select all

> mymodel2@elapsed.time
elapsed 
   0.21 
> mymodel3@elapsed.time
elapsed 
   0.21 
which would mean that discrepancy is within R2MLwiN, rather than MLwiN itself. This would make sense because as far as R is concerned the version with unique IDs has more values to consider and would therefore take longer to process them all. Looking at the code for generating the hierarchy it does seem that there are bugs, which I will investigate and attempt to fix.

adeldaoud
Posts: 63
Joined: Sat Aug 15, 2015 4:00 pm

Re: Number of observations displayed

Post by adeldaoud » Sat Apr 21, 2018 9:11 pm

Thanks for your reply Chris. I was concerned that the difference between the models given the small data would hide in the third or fewer decimal points. So I expanded the same data, and ran the models. As you see below, I do still get a difference within Mlwin. Not sure if this is unique to my PC. Could you please confirm that you get the same or similar difference?

Code: Select all

# expand the data
tut.rep <- bind_rows(replicate(500, tut, simplify = FALSE))
tut.rep <- tut.rep %>%
  arrange(schoolid, classid, studentid)


mymodel2 <- runMLwiN(postthks ~ 1 +  (1 | schoolid) + (1 | classid)+ (1 | studentid), data = tut.rep, 
                   estoptions = list(EstM = 0, optimat=T,  resi.store = F, debugmode=F), MLwiNPath = mlwinpath)


tut4.rep <- bind_rows(replicate(500, tut4, simplify = FALSE))
tut4.rep <- tut4.rep %>%
  arrange(schoolid, classid_Rep, studentid_Rep)


mymodel3 <- runMLwiN(postthks ~ 1 +  (1 | schoolid) + (1 | classid_Rep)+ (1 | studentid_Rep), data = tut4.rep, 
                     estoptions = list(EstM = 0, optimat=T,  resi.store = F, debugmode=F), MLwiNPath = mlwinpath)



Code: Select all

# Estimation time 

## unique id estimation
mymodel2@elapsed.time
elapsed 
   4.94

## within-id estimation

Code: Select all

> > mymodel3@elapsed.time
elapsed 
   4.52  

ChrisCharlton
Posts: 1018
Joined: Mon Oct 19, 2009 10:34 am

Re: Number of observations displayed

Post by ChrisCharlton » Sun Apr 22, 2018 7:17 am

I have just committed a fix for the correct display in R to the development version of R2MLwiN. It looks like it was not using the hierarchy information when calculating the group statistics, and instead assuming that the IDs were unique across the higher levels. If you get a chance to test the change please let me know whether it solves the problem that you were seeing. I will give your new code example a try to see whether I also get a noticeable difference in estimation time.

ChrisCharlton
Posts: 1018
Joined: Mon Oct 19, 2009 10:34 am

Re: Number of observations displayed

Post by ChrisCharlton » Mon Apr 23, 2018 9:48 am

I ran you example a couple of times and one version didn't seem obviously slower than the other:

Code: Select all

# First run
> mymodel2@elapsed.time
elapsed 
   3.53 
> mymodel3@elapsed.time
elapsed 
   3.55 

Code: Select all

# Second run
> mymodel2@elapsed.time
elapsed 
   3.54 
> mymodel3@elapsed.time
elapsed 
   3.53 
The difference you see may be due to something else running on your system, or some kind of caching occurring (if you run the models in the opposite order do the relative times change?).

Post Reply