The inconsistency of studies of gender differences in cognitive abilities: due to using different methods?

I read this study:

Palejwala, M. H., & Fine, J. G. (2015). Gender differences in latent cognitive abilities in children aged 2 to 7. Intelligence, 48, 96-108.
It reminded me that these studies rarely test what happens when one uses a bunch of different methods on the same dataset. Nyborg (2003) wrote years ago that different results were probably to a high degree due to this. There seems to have been no change in research practices since then.
Due to the lack of data sharing, it is generally not possible for researchers inclined to perform a such study on the data they used. One is limited to either gathering some data oneself or finding some open dataset.
Wicherts and Bakker (2012) have provided researchers with an open dataset. It is not at all perfect: unrepresentative; psych students, young and mostly female, and medium-sized; N=400-500 (depending on treatment of missing data).
Obviously, the study cannot be used to determine the size of any difference in the general population. However, it should be sufficient for researchers to see how different methods compare. One can try summing latent variables. One can use EFA in various ways (different extraction methods, Schmid-Leiman transformed or not, hierarchical or not) to extract latent traits and compare the score means and variances. One can do it with CFA/SEM with different models (hierarchical, bi-factor). How would all these results compare? The data is public, so who wants to do this study with me?
The closest study of this kind is maybe Steinmayr et al 2010. But it used non-public data, and did not use all the available methods. For instance, it did not use latent models with a g factor at all, only 5 primary factors (?!).
The above is not to say that using different samples does not alter results too. There are various ways of excluding participants, e.g. for handicaps (fysical, mental, both) which surely change both means and variances. Worse, most data concern only a rather small number of subtests (because they rely on commercial tests, also bad!) which were often picked to minimize gender differences (or at least balance them out which means that total summed score is useless).
It would be better if some super/master dataset could be collected with say 60 very different mental tests: elementary cognitive tests, Piagetian (do these work on adults? or too strong ceiling effect?), matrix, vocabulary, number series, picture completion, analogies, digit span, learning tests, maze solving, get inspiration from Jensen (1980, Chapter 4), … with a minimum of 3 per type of test so a latent group/type factor can be estimated if one exists) and then all researchers could study the same dataset. This is how science should work. Methods and data must be completely open.
Refs
Nyborg, H. (2003). Sex differences in g. The scientific study of general intelligence, 187-222.
Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too?. Intelligence, 40(2), 73-76.
Jensen, A. R. (1980). Bias in mental testing.
Steinmayr, R., Beauducel, A., & Spinath, B. (2010). Do sex differences in a faceted model of fluid and crystallized intelligence depend on the method applied?. Intelligence, 38(1), 101-110.