I was reviewing a paper on social inequality in Brazil at the state level (I published one myself in 2015), and it struck me that one should be able to find data from the next level. States of Brazil are huge. The 6th largest one is about the same size as Germany! So analyzing data at that level is bound to create some serious problems with within-unit heterogeneity. Fortunately, Brazil has a level below that, municipalities/municipals. There are some 5500 municipals which is comparable to the same-level unit in the US, counties, of which there are about 3100 and which I have previously extensively analyzed.

Data?

But the question is: Can we get data from municipals? Can we get cognitive data? The paper used the well known PISA data, but I’ve never seen PISA data available at a level below first-level administrative divisions. PISA data are collected by sampling students and while the OECD is rich, its spending on PISA is not quite at the level of what counties themselves spend on testing. PISA sampling methodology is somewhat opaque, but it seems that the sample sizes are in the area of ~5000 per country. If one wants data from 5000 political units, obviously this won’t do!

So, what can do we instead? Usually, countries keep check on their students’ achievement themselves, and this is especially true for countries in America in my experience. To use the US as an example, they have PISA data, but they public data for only 3 states! However, one can get both state- and county-level cognitive data by using scores from non-international achievement tests. For instance, NAEP scores are readily available by state and even by state x SIRE (self-reported race/ethnicity), which turned out to be very useful. The US county-level data I used in my study were from:

State accountability test results in every grade are used to generate an average level of achievement in each district. That achievement level is adjusted once based on the extent to which the average achievement in that state, as measured by the National Assessment of Educational Progress (NAEP), is higher or lower than the national average. The district level is adjusted a second time based on the extent to which the U.S. does better or worse than students in a set of countries with developed economies, as measured by the Program for International Student Assessment (PISA).

Does Brazil have something similar? If we don’t read Portuguese at more than a very basic level, how can we find out?

Google-fu

We can find a lot of stuff on Google without knowing the local language. I’ve done a number of studies using data from French and Spanish sources already.

Cognitive data

To start with, we have to find the names of one or more national achievement tests used in the country. In this case, we google something like “achievement test Brazil”. The second hit (for me) was a paper talking about a Teste do Desempenho Escolar [TDE] — School Achievement Test. This is good enough to get started. Then we start googling that name. We don’t do so at random. We can’t read most of the stuff we find, so we can’t sort it by name in the search results so easily. However, we can sort by file type. What kind of files do we want? XLS, XLSX or maybe PDF. To make google find these, use “filetype:xls OR filetype:xlsx” (no quotes, help). Researchers publish their findings in PDFs, often including tables, but it’s not a good format to publish large amounts of data. The data are usually in Excel format, for which there are two variants. So we download a lot of Excel files and look around.

I did not have too much luck with that test — at best I found some regional or state-level data — so we go back to Google. Here I found another paper with another test:

We analyze the impact of child labor on school achievement using Brazilian school achievement test data from the 2003 Sistema Nacional de Avaliação da Educação Básica (SAEB). We control for the endogeneity of child labor using instrumental variable techniques, where the instrumental variable is the average wage for unskilled male labor in the state. Using our preferred OLS estimates, we find that child labor causes a loss in students’ school achievement. Children and adolescents who do not work have better school performance than students who work. Up to two hours of work per day do not have a statistically significant effect on school performance, but additional hours decrease student’s achievement. Differences in work conditions affect school performance. For high school students in Portuguese, compared to students who have schooling as their only activity, students who work only at home score 4 percent lower on the tests. Those students who only work outside the house are worse off than those who only work within the house, with test scores decreasing by 5 percent. Students who work both inside and outside the house have the lowest test scores of all the working conditions, decreasing by up to 7 percent.

So, then we google that and look for XLS, XLSX again. I found a lot of stuff, but not so much data. So we need to be more specific. Since we want municipal data, we go to Google Translate and get a good phrase “by municipal” in the local language, “por município“. I still did not find any good data, but there were lots of documents. Lots of documents means we are onto something, but can’t seem to find the right thing. It’s a filtering problem, not a lack of content problem.

An idea is to try another name variant. Tests with long names usually get abbreviated. NAEP means National Assessment of Educational Progress. Is there an abbreviation of the Brazilian test? Yes, SAEB. So we google that, same approach as before. JACKPOT. First hit is a file with 10 data tables (tabela), one of which has “Prova Brasil (2005) – Proficiências – 4ª e 8ª série do Ensino Fundamental” meaning Proof Brazil (2005) – Proficiencies – 4th and 8th grade of Elementary School. There’s a column “Município”. How many rows? We’re looking for ~5500 because that’s the number of municipals according to Wikipedia. 5565. Great! The scores seem to be raw, not discretized into things like low/medium/high. This is good because discretized data are problematic/annoying to use. We also have the name, Prova Brasil — Test of Brazil — apparently the name of the test undertaking. It’s from 2005, so perhaps we can find newer data. And we can, Prova Brasil is available for many years (2005, 2007, 2009, 2011, 2013 and 2015 was planned?) and finding the data is easy now that we know what to search for. So now we’re at the point where we want to learn more about the test, so we google the name and pick the most governmenty website we can find and then translate that.

The Brazil Test and the National System for the Evaluation of Basic Education (Saeb) are large-scale diagnostic evaluations developed by the National Institute of Educational Studies and Research Anísio Teixeira (Inep / MEC). They aim to evaluate the quality of the education offered by the Brazilian educational system based on standardized tests and socioeconomic questionnaires.

In the tests applied in the fourth and eighth grades of elementary education, students respond to Portuguese-language items (questions), focusing on reading, and mathematics, with a focus on problem solving. In the socioeconomic questionnaire, students provide information about context factors that may be associated with performance.

Teachers and class directors and schools also respond to questionnaires that collect demographic data, professional profile and working conditions.

Based on information provided by Saeb and Prova Brasil, the MEC and the state and municipal education secretariats can define actions aimed at improving the quality of education in the country and reducing existing inequalities, promoting, for example, the correction of distortions and Weaknesses identified and directing their technical and financial resources to areas identified as priorities.

The performance averages in these evaluations also subsidize the calculation of the Basic Education Development Index (Ideb), along with the approval rates in these areas.

In addition, the data are also available to the whole society, which, from the results, can follow the policies implemented by different spheres of government. In the case of Prova Brasil, we can still observe the specific performance of each teaching network and the system as a whole of the country’s urban and rural public schools.

This is more or less perfect. Wish other countries had such great publicly available data!

Other data

Cognitive data is the hardest to find. Since we know we can find that, we can probably find all kinds of other data. The search method is easy, just use the by municipal phrase — por município — and look for Excel files. Usually socioeconomic data use names that are somewhat recognizable in Romance languages. Even a small language similarity goes a long way with words that rarely change! The first useful file to come up was a population breakdown by age and sex. 12000 rows. [ETA: That file is actually from Mexico which apparently also has Municipals as the second-level unit! Thanks to for pointing out the error. Maybe someone can find some matching cognitive data and we have another dataset to use.]

At this point, all we have to do is google translate some approximate names of the variables we want — poverty, income, race/color, health, obesity, crime, arrest, sentences –, add the by municipal phrase and search for Excel files. After that comes the real arduous task, namely combining all that messy data. That’s a topic for another time, but basically the solution is to find or make some kind of unique ID for each municipal and use that to merge.

Data sharing

I did some of the above but I really, really don’t have time to analyze it right now. So, I put all the stuff I found on OSF. Maybe someone else can take a stab at it.