Assignment 2: Data Management#
Introduction#
In this assignment you will work on a small but realistic data management project. As always, you will collaborate via git.
To avoid disappointments, here are a few rules for all tasks:
Write good commit messages and commit frequently. Use git for the entire process. Do not hesitate to commit unfinished or broken code. Git should be the only way to share code with your peers.
You only get points if you contribute. If you don’t commit at all or your only commit trivial stuff (like fixing a typo in a comment) you will not get points, even if your group provides a good solution.
All functions need docstrings
Functions must not have side effects on inputs
Never overwrite the original data
Do not commit generated files (e.g. cleaned datasets)
Follow the rules for working with file paths
Use the “modern pandas” settings for all exercises
The entire solution has to be in .py files. If you find it easier, you can prototype
some of the functions in jupyter notebooks. In that case, it is a good idea if each
group member has their own notebook, so you do not get merge conflicts.
The deadline is 24 November, 11:59 pm#
Background#
In this assignment you will do data management for the paper Estimating the Technology of Cognitive and Noncognitive Skill Formation by Cunha, Heckman and Schennach (CHS), Econometrica, 2010.
Doing the complete data management of such a complicated project is not possible in one assignment (it often takes weeks or months). Therefore, you will only work with a small subset of the variables needed to replicate the paper. Moreover, we will save you some of the most painful steps by providing a pre-processed version of the dataset and csv files that will help you to harmonize variable names between panel waves.
We will focus on the Behavior Problem Index that is used to measure non-cognitive skills. This index has the subscales antisocial behavior, anxiety, dependence, headstrong, hyperactive and peer problems. Here is an overview.
The assignment repository contains a file called src/original_data/original_data.zip,
with four files in it:
BEHAVIOR_PROBLEMS_INDEX.dta: Contains the main data you will work with. It is in wide format and the variable names are not informative. Moreover, the names do not contain the survey year in which the question was asked.bpi_variable_info.csv: Contains information that will help you to decompose the main dataset into datasets for each year and to rename the variables such that the same questions get the same name across periods. In a real project you would have to generate this information yourself.BEHAVIOR_PROBLEMS_INDEX.cdb: The codebook of the dataset. If you have any questions about the data, the answers are probably in the codebook.chs_data.dta: The data file used in the original paper by Cunha Heckman and Schennach.
The chs_data is taken from the online appendix of the paper. bpi_variable_info.csv
was created by us. The other files were downloaded using the
NLS Investigator
Task 1#
Follow this link, create the repository for your group and clone it to your computers.
Task 2#
For this task you will work in unzip.py and store the results in the bld directory.
Use pathlib to check if the bld directory exists and create it otherwise.
Modify your .gitignore file to make sure that all files in the bld directory are
ignored.
Unzip the the file src/original_data/original_data.zip. This
stackoverflow post
tells you how. Since the unzipped files are generated, they should not be under version
control.
Task 3#
For this task you will work in clean_chs_data.py
Implement the function clean_chs_data. The function should take a DataFrame and return
a DataFrame.
Use as many helper functions as you need. Give them good names and mark them as helper functions by starting the name with an underscore.
The cleaned data should contain the following columns:
Cleaned versions of the behavioral problems index (bpi) variables with sensible names. “Cleaned version” means that missings are coded as pandas missings (NA or NaN) instead of some numerical values with special meaning. The mapping of sensible to raw names is as follows:
Sensible name
Raw name
bpi_antisocial_chs
bpiA
bpi_anxiety_chs
bpiB
bpi_headstrong_chs
bpiC
bpi_hyperactive_chs
bpiD
bpi_peer_chs
bpiE
The column “momid” based on the original column “momid”. Choose a suitable dtype.
“age”, as in the raw data. This can be an integer as it was discretized to two year bins.
Set the index to [“childid”, “year”] and choose suitable dtypes for both index variables
At the bottom of the py file you find an if __name__ == '__main__' clause. You can
watch a video on why it is necessary if
you want.
Add the code suggested by the comments inside the if condition.
For those who watched the video: You do not have to put everything you do inside the condition into a main function. Putting multiple function calls into the if condition is completely fine.
Do not forget to write docstrings for all functions. A one-liner is enough for helper functions.
Task 4#
For this task you will work in clean_nlsy_data.py.
Prelude#
You will eventually implement the function _clean_one_wave. The function should have
the following properties:
It takes three arguments:
The entire raw NLSY dataset
A year (between 1986 and 2010, both inclusive and only even year numbers)
The bpi variable info (as a DataFrame).
It returns a DataFrame with the cleaned data of the requested year.
The cleaned data should have the same index as chs_data and it should contain the
following variables:
Clean versions of all variables that make up the BPI. They have an ordered categorical dtype with the values
not true\(<\)sometimes true\(<\)often true$. The variables have the name indicated in thebpi_variable_infofile. Missings are coded as pandas missings (NA or NaN) instead of negative numbers.Note: The dataset contains surprises such as variables that take values you did not expect or category labels that are similar but not identical across variables. Try to find good solutions for each of them.
Scores for each subscale of the behavioral problems index, which are calculated by averaging the items of that subscale.
antisocial
anxiety
headstrong
hyperactive
peer
dependence
Note: Before averaging, categorical variables will have to be converted to numbers. For this, the answers
sometimes trueandoften true, are counted as 1;not trueis counted as 0.
Do not forget to write docstrings for all functions. A one-liner is enough for helper functions.
Task 4a#
Think about the structure of the function _clean_one_wave. What are the relative
merits of using the metadata programmatically vs. hardcoding the names of the variables?
Discuss your answer in a couple of bullet points in the README.md file.
Task 4b#
Implement the function _clean_one_wave.
Task 5#
Continue to work in clean_nlsy_data.py.
Implement the function manage_nlsy_data. This function has the following properties:
It takes two arguments:
The entire raw NLSY dataset
The bpi variable info (as a DataFrame).
It returns a DataFrame with the cleaned data of the requested year.
In terms of implementation, it should:
call
_clean_one_waveto create a list of cleaned yearly datasetsconcatenate the list of cleaned yearly datasets into one DataFrame in long format
only keep the data for even years between 1986 and 2010
Add data loading and function calls in the if __name__ == '__main__ condition.
Task 6#
For this task you will work in merge.py.
Merge the clean chs data with the clean nlsy data. Only keep observations that are present in both DataFrames. Before you merge, check that there are no overlaps in column names between the two datasets.
Restrict the merged data sets to the age groups 5 to 13 (both inclusive)
Use an if __name__ == '__main__' condition for all function calls.
Task 7#
For this task you will work in plot.py
You will plot your scores against the ones in the chs data – for each score and age.
Make a grid of regression plots for each score that show how your score relates to the corresponding score in the chs data. Each grid contains 5 subplots, one for each age group.
Note: Making such grids is really easy in plotly. Search for the word facet in the
documentation of
px.scatter
The names of their score relate to your names as follows:
The dependence scale has no counterpart in the chs data. If you did everything correctly you should see a perfectly negative correlation for some variables and a strong but not perfectly negative correlation for other variables.
Save the plots under suitable file names in the bld folder. .png is the preferred
format. If you are on Windows and have trouble to export static plotly plots (even with
the workaround) you can switch to .html instead.
Use an if __name__ == '__main__' condition for all function calls.