Assignment 2: Data Management

Assignment 2: Data Management#

Introduction#

In this assignment you will work on a small but realistic data management project. As always, you will collaborate via git.

To avoid disappointments, here are a few rules for all tasks:

Write good commit messages and commit frequently. Use git for the entire process. Do not hesitate to commit unfinished or broken code. Git should be the only way to share code with your peers.
You only get points if you contribute. If you don’t commit at all or your only commit trivial stuff (like fixing a typo in a comment) you will not get points, even if your group provides a good solution.
All functions need docstrings
Functions must not have side effects on inputs
Never overwrite the original data
Do not commit generated files (e.g. cleaned datasets)
Follow the rules for working with file paths
Use the “modern pandas” settings for all exercises

The entire solution has to be in .py files. If you find it easier, you can prototype some of the functions in jupyter notebooks. In that case, it is a good idea if each group member has their own notebook, so you do not get merge conflicts.

The deadline is 24 November, 11:59 pm#

Background#

In this assignment you will do data management for the paper Estimating the Technology of Cognitive and Noncognitive Skill Formation by Cunha, Heckman and Schennach (CHS), Econometrica, 2010.

Doing the complete data management of such a complicated project is not possible in one assignment (it often takes weeks or months). Therefore, you will only work with a small subset of the variables needed to replicate the paper. Moreover, we will save you some of the most painful steps by providing a pre-processed version of the dataset and csv files that will help you to harmonize variable names between panel waves.

We will focus on the Behavior Problem Index that is used to measure non-cognitive skills. This index has the subscales antisocial behavior, anxiety, dependence, headstrong, hyperactive and peer problems. Here is an overview.

The assignment repository contains a file called src/original_data/original_data.zip, with four files in it:

BEHAVIOR_PROBLEMS_INDEX.dta: Contains the main data you will work with. It is in wide format and the variable names are not informative. Moreover, the names do not contain the survey year in which the question was asked.
bpi_variable_info.csv: Contains information that will help you to decompose the main dataset into datasets for each year and to rename the variables such that the same questions get the same name across periods. In a real project you would have to generate this information yourself.
BEHAVIOR_PROBLEMS_INDEX.cdb: The codebook of the dataset. If you have any questions about the data, the answers are probably in the codebook.
chs_data.dta: The data file used in the original paper by Cunha Heckman and Schennach.

The chs_data is taken from the online appendix of the paper. bpi_variable_info.csv was created by us. The other files were downloaded using the NLS Investigator

Task 1#

Follow this link, create the repository for your group and clone it to your computers.

Task 2#

For this task you will work in unzip.py and store the results in the bld directory.

Use pathlib to check if the bld directory exists and create it otherwise.

Modify your .gitignore file to make sure that all files in the bld directory are ignored.

Unzip the the file src/original_data/original_data.zip. This stackoverflow post tells you how. Since the unzipped files are generated, they should not be under version control.

Task 3#

For this task you will work in clean_chs_data.py

Implement the function clean_chs_data. The function should take a DataFrame and return a DataFrame.

Use as many helper functions as you need. Give them good names and mark them as helper functions by starting the name with an underscore.

The cleaned data should contain the following columns:

Cleaned versions of the behavioral problems index (bpi) variables with sensible names. “Cleaned version” means that missings are coded as pandas missings (NA or NaN) instead of some numerical values with special meaning. The mapping of sensible to raw names is as follows:

Sensible name

Raw name

bpi_antisocial_chs

bpiA

bpi_anxiety_chs

bpiB

bpi_headstrong_chs

bpiC

bpi_hyperactive_chs

bpiD

bpi_peer_chs

bpiE
The column “momid” based on the original column “momid”. Choose a suitable dtype.
“age”, as in the raw data. This can be an integer as it was discretized to two year bins.

Set the index to [“childid”, “year”] and choose suitable dtypes for both index variables

At the bottom of the py file you find an if __name__ == '__main__' clause. You can watch a video on why it is necessary if you want.

Add the code suggested by the comments inside the if condition.

For those who watched the video: You do not have to put everything you do inside the condition into a main function. Putting multiple function calls into the if condition is completely fine.

Do not forget to write docstrings for all functions. A one-liner is enough for helper functions.

Task 4#

For this task you will work in clean_nlsy_data.py.

Prelude#

You will eventually implement the function _clean_one_wave. The function should have the following properties:

It takes three arguments:
1. The entire raw NLSY dataset
2. A year (between 1986 and 2010, both inclusive and only even year numbers)
3. The bpi variable info (as a DataFrame).
It returns a DataFrame with the cleaned data of the requested year.

The cleaned data should have the index you used in Task 3 and it should contain the following variables:

Clean versions of all variables that make up the BPI. They have an ordered categorical dtype with the values not true \(<\) sometimes true \(<\) often true. The variables have the name indicated in the bpi_variable_info file. Missings are coded as pandas missings (NA or NaN) instead of negative numbers.

Note: The dataset contains surprises such as variables that take values you did not expect or category labels that are similar but not identical across variables. Try to find good solutions for each of them.
Scores for each subscale of the behavioral problems index, which are calculated by averaging the items of that subscale.
- antisocial
- anxiety
- headstrong
- hyperactive
- peer
- dependence
Note: Before averaging, categorical variables will have to be converted to numbers. For this, the answers sometimes true and often true, are counted as 1; not true is counted as 0.

Do not forget to write docstrings for all functions. A one-liner is enough for helper functions.

Task 4a#

Think about the structure of the function _clean_one_wave. What are the relative merits of using the metadata programmatically vs. hardcoding the names of the variables? Discuss your answer in a couple of bullet points in the README.md file.

Task 4b#

Implement the function _clean_one_wave.

Task 5#

Continue to work in clean_nlsy_data.py.

Implement the function manage_nlsy_data. This function has the following properties:

It takes two arguments:
1. The entire raw NLSY dataset
2. The bpi variable info (as a DataFrame).
It returns a DataFrame with the cleaned data of the requested year.

In terms of implementation, it should:

call _clean_one_wave to create a list of cleaned yearly datasets
concatenate the list of cleaned yearly datasets into one DataFrame in long format
only keep the data for even years between 1986 and 2010

Add data loading and function calls in the if __name__ == '__main__ condition.

Task 6#

For this task you will work in merge.py.

Merge the clean chs data with the clean nlsy data. Only keep observations that are present in both DataFrames. Before you merge, check that there are no overlaps in column names between the two datasets.

Restrict the merged data sets to the age groups 5 to 13 (both inclusive)

Use an if __name__ == '__main__' condition for all function calls.

Task 7#

For this task you will work in plot.py

You will plot your scores against the ones in the chs data – for each score and age.

Make a grid of regression plots for each score that show how your score relates to the corresponding score in the chs data. Each grid contains 5 subplots, one for each age group.

Note: Making such grids is really easy in plotly. Search for the word facet in the documentation of px.scatter

The names of their score relate to your names as follows:

The dependence scale has no counterpart in the chs data. If you did everything correctly you should see a perfectly negative correlation for some variables and a strong but not perfectly negative correlation for other variables.

Save the plots under suitable file names in the bld folder. .png is the preferred format.