Supplementary — White by Default: Bias in Criminal Racial Assignment

Downloads

Download Figures31 images

cluster_right.png↓combined_states_rates_dark.png↓combined_states_ratio_dark.png↓combined_white_hispanic_dark.png↓criminal-mugshot-collage-with-pca.png↓feature_breakdown_example_dark_black.png↓individual_states_ratio_bw_dark.png↓individual_states_ratio_dark.png↓infographic_1.jpg↓infographic_2.jpg↓infographic_3.jpg↓infographic_4.jpg↓infographic_collage.png↓plot_800_criminal_states_dark.png↓plot_800_criminal_states_dark_small.png↓plot_all_races_dark.png↓plot_assigned_race_distribution_dark.png↓plot_confidence_distributions_dark.png↓plot_default_with_ci_all_dark.png↓plot_default_with_ci_bwh_dark.png↓plot_flipped_with_ci_all_dark.png↓plot_flipped_with_ci_bwh_dark.png↓plot_misclassification_by_bins_dark.png↓plot_qualitative_states_dark.png↓plot_qualitative_states_dark_small.png↓plot_state_bars_dark.png↓plot_state_scatter_dark.png↓plot_state_scatter_native_pct_dark.png↓plot_three_races_dark.png↓viral_post.png↓viral_post_2.jpg↓

Source CodeView on GitHub

Open Science FrameworkData, paper, and replication materials

Appendix

Mathematical Formulation of Bias for Simulation

For each bias scenario, we systematically reassigned 10% of Greens to Blue classification, with the selection mechanism varying according to the specific bias type being simulated.

Let the Blue centroid be represented by coordinates $(c_x, c_y)$ in our two-dimensional feature space. For each Green point $g_i$ with coordinates $(x_i, y_i)$ , we calculated the Euclidean distance to the Blue centroid:

d_i = \sqrt{(x_i - c_x)^2 + (y_i - c_y)^2}

The bias assignment probabilities were then defined using exponential functions that create distinct selection patterns for Strategic and Obvious bias types:

P_{\text{strategic}}(i) = \exp(-d_i)

P_{\text{obvious}}(i) = \exp(d_i)

These formulations ensure that Strategic bias preferentially selects Green individuals closest to the Blue centroid (higher probability for smaller distances), while Obvious bias preferentially selects those most distant from the Blue centroid (higher probability for larger distances).

We then used weighted sampling on these probabilities (higher values more likely to be sampled) for each respective scenario to produce our end-product simulated datasets:

Strategic bias: weights proportional to $P_{\text{strategic}}(i) = \exp(-d_i)$
Obvious bias: weights proportional to $P_{\text{obvious}}(i) = \exp(d_i)$
Random bias: uniform weights (weights = 1 for all individuals)

Model Training and Evaluation on Simulations

Following bias introduction, we trained multinomial logistic regression models on each simulated dataset using the simple formula race ~ x + y, where x and y represent the two-dimensional coordinates. This straightforward approach mirrors our linear modeling strategy for the real-world data while maintaining interpretive clarity.

To address class imbalances created by the reassignment process, we implemented inverse frequency weighting:

w_j = \frac{N_{\text{total}}}{3 \times N_j}

where $w_j$ represents the weight for class $j$ , $N_{\text{total}}$ is the total sample size, and $N_j$ is the number of observations in class $j$ after bias introduction. This weighting scheme ensures that our models optimize for balanced performance across all three groups. This was done to correct for the imbalance produced by reassignment, and for consistency with the method used on the real dataset.

The simulation process generates four distinct datasets: the original unbiased dataset plus three variants incorporating Random, Strategic, and Obvious bias patterns, respectively. These four scenarios are visualized, illustrating how each bias type creates characteristic distortions in the group assignment patterns.

plot_bias_comparison

Supplementary Materials

Downloads

Appendix

Mathematical Formulation of Bias for Simulation

Model Training and Evaluation on Simulations