Purpose
Apply Stanford’s Open Policing methodology on Mpls Police Data.
https://openpolicing.stanford.edu/
They have an R script for Saint Paul but not Mpls.
Description
Data Sources
Minneapolis Police Stop Data: https://opendata.minneapolismn.gov/datasets/police-stop-data
Other Notes
Change log
## Load libraries
# read
library(tidycensus); census_api_key("0c5d9fddc67544f23d3fa524660c9d81d4be2c90") # add API key here
# wrangle
library(tidyverse)
library(lubridate) # work with dates and times
# viz
library(janitor) # clean names and make nice tables
library(gghighlight)
library(ggthemes)
library(gt) # nice tables
# model
library(sf) # GIS
# export
## Custom functions
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
Get police stop data.
Use an API link (ArcGIS) to directly download the latest geojson using st_read
Looking at the first rows of the data:
stop <- st_read("https://opendata.arcgis.com/datasets/215b4b543d894750aef86c725b56ee2a_0.geojson")
## Reading layer `Police_Stop_Data' from data source `https://opendata.arcgis.com/datasets/215b4b543d894750aef86c725b56ee2a_0.geojson' using driver `GeoJSON'
## Simple feature collection with 171935 features and 19 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -93.32915 ymin: 0 xmax: 0 ymax: 45.05124
## geographic CRS: WGS 84
head(stop)
## Simple feature collection with 6 features and 19 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -93.28807 ymin: 44.90617 xmax: -93.24741 ymax: 45.02484
## geographic CRS: WGS 84
## OBJECTID masterIncidentNumber responseDate reason
## 1 1 16-395258 2016-10-31 17:40:47
## 2 2 16-395296 2016-10-31 18:06:36
## 3 3 16-395326 2016-10-31 18:20:54
## 4 4 16-395328 2016-10-31 18:23:20
## 5 5 16-395333 2016-10-31 18:26:05
## 6 6 16-395364 2016-10-31 18:45:31 <NA>
## problem callDisposition citationIssued personSearch
## 1 Suspicious Person (P) BKG-Booking YES
## 2 Traffic Law Enforcement (P) TAG-Tagged NO
## 3 Attempt Pick-Up (P) RFD-Refused NO
## 4 Suspicious Person (P) BKG-Booking YES
## 5 Suspicious Vehicle (P) GOA-Gone on Arrival NO
## 6 Traffic Law Enforcement (P) ADV-Advised <NA> <NA>
## vehicleSearch preRace race gender lat long x y
## 1 NO Black Black Male 44.97957 -93.27257 -10383055 5618306
## 2 NO Unknown Black Male 44.96269 -93.27592 -10383428 5615650
## 3 NO Unknown Unknown Unknown 45.02484 -93.28807 -10384780 5625432
## 4 NO Black Black Male 44.94656 -93.24741 -10380254 5613112
## 5 NO Other Unknown Male 44.90617 -93.25501 -10381100 5606762
## 6 <NA> <NA> <NA> <NA> 44.93954 -93.27791 -10383650 5612008
## policePrecinct neighborhood lastUpdateDate
## 1 1 Downtown West 2017-08-08 05:25:31
## 2 5 Steven's Square - Loring Heights 2017-08-08 05:26:13
## 3 4 Webber - Camden 2017-08-08 05:24:35
## 4 3 Corcoran 2017-08-08 05:25:31
## 5 3 Hale 2017-08-08 05:25:03
## 6 5 Lyndale 2017-08-08 05:25:31
## geometry
## 1 POINT (-93.27257 44.97957)
## 2 POINT (-93.27592 44.96269)
## 3 POINT (-93.28807 45.02484)
## 4 POINT (-93.24741 44.94656)
## 5 POINT (-93.25501 44.90617)
## 6 POINT (-93.27791 44.93954)
How many total stops? Distinct count of master incident numbers.
n_distinct(stop$masterIncidentNumber)
## [1] 171935
What’s the date range?
min(stop$responseDate)
## [1] "2016-10-31 17:40:47 CDT"
max(stop$responseDate)
## [1] "2021-01-12 21:26:47 CST"
Let’s focus on 2016-2020. Convert from sf to a tibble - faster to work with, and can join geometry back later on.
stop2 <- stop %>%
filter(year(responseDate) >= 2017,
year(responseDate) < 2021) %>%
as_tibble() %>%
select(-geometry)
For now, look at vehicular stops. Try doing pedestrian stops later.
stop2 %>%
tabyl(reason) %>%
adorn_pct_formatting() %>%
gt()
| reason | n | percent | valid_percent |
|---|---|---|---|
| 10432 | 6.3% | 7.2% | |
| Citizen / 9-1-1 | 42854 | 26.0% | 29.6% |
| Equipment Violation | 20871 | 12.7% | 14.4% |
| Investigative | 28823 | 17.5% | 19.9% |
| Moving Violation | 41674 | 25.3% | 28.8% |
| NA | 19887 | 12.1% | - |
stop2 %>%
tabyl(problem) %>%
adorn_pct_formatting() %>%
gt()
| problem | n | percent |
|---|---|---|
| Attempt Pick-Up (P) | 3643 | 2.2% |
| Curfew Violations (P) | 112 | 0.1% |
| Suspicious Person (P) | 45784 | 27.8% |
| Suspicious Vehicle (P) | 36458 | 22.2% |
| Traffic Law Enforcement (P) | 78492 | 47.7% |
| Truancy (P) | 52 | 0.0% |
What’s the interaction between reasons and problems?
stop2 %>%
tabyl(reason, problem) %>%
gt()
| reason | Attempt Pick-Up (P) | Curfew Violations (P) | Suspicious Person (P) | Suspicious Vehicle (P) | Traffic Law Enforcement (P) | Truancy (P) |
|---|---|---|---|---|---|---|
| 220 | 6 | 1830 | 2004 | 6364 | 8 | |
| Citizen / 9-1-1 | 1196 | 16 | 22804 | 18590 | 237 | 11 |
| Equipment Violation | 2 | 1 | 332 | 866 | 19670 | 0 |
| Investigative | 1130 | 69 | 12723 | 8877 | 6001 | 23 |
| Moving Violation | 0 | 1 | 878 | 1912 | 38883 | 0 |
| NA | 1095 | 19 | 7217 | 4209 | 7337 | 10 |
Do we know why many stop reasons are NA? Race and Vehicle/Person search for these are NA as well, so drop from the data.
stop2 %>%
filter(is.na(reason)) %>%
tabyl(race)
## race n percent valid_percent
## <NA> 19887 1 NA
Does it make sense to include both Suspicious Vehicles and Traffic Law Enforcement as problems? Also filter out missing reasons.
How many stops now?
stop_v <- stop2 %>%
filter(problem %in% c("Suspicious Vehicle (P)", "Traffic Law Enforcement (P)"),
!is.na(reason))
n_distinct(stop_v$masterIncidentNumber) # only one incident per row, so we can do a row count
## [1] 103404
How many by year?
stop_v %>%
mutate(year = year(responseDate)) %>%
count(year) %>%
gt()
| year | n |
|---|---|
| 2017 | 32715 |
| 2018 | 29140 |
| 2019 | 24008 |
| 2020 | 17541 |
How many stops by race?
stop_v %>%
tabyl(race) %>%
adorn_pct_formatting() %>%
gt()
| race | n | percent |
|---|---|---|
| Asian | 1705 | 1.6% |
| Black | 37962 | 36.7% |
| East African | 6388 | 6.2% |
| Latino | 4712 | 4.6% |
| Native American | 1647 | 1.6% |
| Other | 2673 | 2.6% |
| Unknown | 21147 | 20.5% |
| White | 27170 | 26.3% |
Stops by race over the years:
stop_v %>%
mutate(year = year(responseDate)) %>%
count(year, race) %>%
ggplot(aes(year, n, color = race)) +
geom_line(size = 1) +
geom_point() +
gghighlight(race %in% c("White", "Black", "East African", "Unknown",
"Latino", NA)) +
theme_fivethirtyeight() +
labs(title = "Police stops over time by race")
A few observations:
Get a population as a baseline to compare stop rates. Are black/East African drivers stopped proportionally more than other drivers?
Get the population estimates from the census using tidycensus
# list of variables to get and their names
var <- c(
"Total" = "DP05_0033",
"White" = "DP05_0077", # using white non-hispanic
"Black" = "DP05_0038",
"Native American" = "DP05_0039",
"Asian" = "DP05_0044",
"Latino" = "DP05_0071", # hispanic/latinx of any race
"Other" = "DP05_0057" # some other race
)
# query census API with list of variables
mpls_race <- get_acs(
geography = "place", state = "MN",
year = 2019,
survey = "acs5", # get 5 year estimates - smaller margin of error and more appropriate for multi-year data
variables = var
) %>%
filter(str_detect(NAME, "Minneapolis")) %>% # get only Minneapolis from other MN places
select(-c(moe, GEOID, NAME))
# get percentages (out of total)
mpls_race %>%
mutate(total = estimate[variable == "Total"]) %>%
mutate(prop = estimate / total) %>%
select(-c(total)) %>%
gt() %>%
fmt_percent(
columns = vars(prop),
decimals = 1
)
| variable | estimate | prop |
|---|---|---|
| Total | 420324 | 100.0% |
| Black | 80664 | 19.2% |
| Native American | 5956 | 1.4% |
| Asian | 24961 | 5.9% |
| Other | 20915 | 5.0% |
| Latino | 40432 | 9.6% |
| White | 252180 | 60.0% |
To match census data, combine East African with black drivers.
stop_v2 <- stop_v %>%
mutate(race2 = if_else(race == "East African", "Black", race))
stop_v2 %>%
tabyl(race2) %>%
adorn_pct_formatting() %>%
gt()
| race2 | n | percent |
|---|---|---|
| Asian | 1705 | 1.6% |
| Black | 44350 | 42.9% |
| Latino | 4712 | 4.6% |
| Native American | 1647 | 1.6% |
| Other | 2673 | 2.6% |
| Unknown | 21147 | 20.5% |
| White | 27170 | 26.3% |
What is the rate of police stops considering the population of races in Mpls?
Not sure what to do about Unknown/NA - quite a few in this category
stop_v2 %>%
count(race2) %>%
left_join(mpls_race, by = c("race2" = "variable")) %>%
mutate(stop_rate = n / estimate) %>%
gt() %>%
fmt_number(
columns = vars(stop_rate),
decimals = 2
)
| race2 | n | estimate | stop_rate |
|---|---|---|---|
| Asian | 1705 | 24961 | 0.07 |
| Black | 44350 | 80664 | 0.55 |
| Latino | 4712 | 40432 | 0.12 |
| Native American | 1647 | 5956 | 0.28 |
| Other | 2673 | 20915 | 0.13 |
| Unknown | 21147 | NA | NA |
| White | 27170 | 252180 | 0.11 |
Black drivers are stopped more than 5 times compared to white drivers. Latinx slightly more. Native American drivers are stopped 1.5 times more than white drivers.
What proportion of drivers of each race are searched (vehicle or person)?
stop_v2 %>%
group_by(race2) %>%
summarize(
search_rate = mean(vehicleSearch == "YES", na.rm = TRUE),
frisk_rate = mean(personSearch == "YES", na.rm = TRUE)
) %>%
gt() %>%
fmt_number(
columns = vars(search_rate, frisk_rate),
decimals = 2
)
| race2 | search_rate | frisk_rate |
|---|---|---|
| Asian | 0.06 | 0.06 |
| Black | 0.16 | 0.17 |
| Latino | 0.09 | 0.09 |
| Native American | 0.25 | 0.31 |
| Other | 0.09 | 0.09 |
| Unknown | 0.04 | 0.02 |
| White | 0.05 | 0.06 |
Black drivers are 3 times more likely to have their vehicle searched or be frisked than white drivers. Native American drivers are 4.7 times more likely for vehicle searches and 5.8 times for being frisked.
Here the caveats from the Stanford Open Policing Project are important. These rates are not in themselves evidence of bias or discrimination because we don’t know the underlying rates of breaking traffic laws or committing crimes. Also it may reflect differences in transportation across races.
Here we could look at hit rate or the contraband recovery rate - how often searches were justified by finding contraband.
We don’t have that in this data set. We do have whether a citation was issued (as its own variable) and booking as part of the call disposition variable.
Should we look at vehicle and person search (frisking) separately?
stop_v2 %>%
tabyl(personSearch, vehicleSearch) %>%
gt()
| personSearch | NO | YES |
|---|---|---|
| NO | 90679 | 2157 |
| YES | 1924 | 8644 |
Actually, vehicle combined with person searches are more common than either search alone. To simplify, we can create a variable specifying whether any type of search is conducted.
stop_v3 <- stop_v2 %>%
mutate(any_search = if_else(personSearch == "YES" | vehicleSearch == "YES",
1, 0))
stop_v3 %>%
filter(any_search == 1) %>%
group_by(race2) %>%
summarize(
citation_rate = mean(citationIssued == "YES", na.rm = TRUE),
book_rate = mean(callDisposition == "BKG-Booking", na.rm = TRUE)
) %>%
gt() %>%
fmt_number(
columns = vars(citation_rate, book_rate),
decimals = 2
)
| race2 | citation_rate | book_rate |
|---|---|---|
| Asian | 0.13 | 0.49 |
| Black | 0.18 | 0.32 |
| Latino | 0.27 | 0.36 |
| Native American | 0.10 | 0.53 |
| Other | 0.15 | 0.38 |
| Unknown | 0.12 | 0.11 |
| White | 0.16 | 0.45 |
Here lower rates could signal discrimination. If black drivers are stopped but only cited or booked 5% of the time, and white drivers 50% of the time, “would lead us to believe that officers made sure they were certain white individuals had contraband before deciding to search, but that they were searching black individuals on a whiff of evidence” (Stanford Open Policing Project).
Here we see black drivers have a higher citation rate than white drivers - Native American drivers have a somewhat lower rate. However, 45% of searches end up with white drivers booked, but only 32% of black drivers, and 32% of Latinx drivers. Here Native American drivers have a higher book rate.
stop_v3 %>%
filter(any_search == 1) %>%
mutate(year = year(responseDate)) %>%
group_by(year, race2) %>%
summarize(
citation_rate = mean(citationIssued == "YES", na.rm = TRUE),
book_rate = mean(callDisposition == "BKG-Booking", na.rm = TRUE)
) %>%
pivot_longer(citation_rate:book_rate, names_to = "rate", values_to = "value") %>%
filter(race2 %in% c("White", "Black", "Native American", "Latino")) %>%
ggplot(aes(year, value, color = race2)) +
geom_line(size = 1) +
facet_wrap(~ rate) +
theme_fivethirtyeight() +
labs(title = "Rates of booking and citations by race over time",
color = "") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
Does the citation or booking rate differ across police precincts? There are only 5 precincts, may not be useful to see. What about Minneapolis neighborhoods?
rates <- stop_v3 %>%
filter(any_search == 1) %>%
group_by(race2, neighborhood) %>%
summarize(
citation_rate = mean(citationIssued == "YES", na.rm = TRUE),
book_rate = mean(callDisposition == "BKG-Booking", na.rm = TRUE)
)
# What are the total stops by neighborhood?
stops_by_nh <- stop_v3 %>%
filter(any_search == 1) %>%
group_by(neighborhood) %>%
summarize(stops = n())
# make a tidy data frame with white rates, and minority rates per neighborhood
citation_rate <- rates %>%
ungroup() %>%
select(-book_rate) %>%
filter(race2 %in% c("White", "Black", "Latino"),
!is.na(neighborhood)) %>%
pivot_wider(names_from = "race2", values_from = "citation_rate",
values_fill = 0) %>%
left_join(stops_by_nh, by = "neighborhood") %>%
pivot_longer(c(Black:Latino), names_to = "minority_race",
values_to = "minority_citation_rate") %>%
rename(white_citation_rate = White)
book_rate <- rates %>%
ungroup() %>%
select(-citation_rate) %>%
filter(race2 %in% c("White", "Black", "Latino"),
!is.na(neighborhood)) %>%
pivot_wider(names_from = "race2", values_from = "book_rate",
values_fill = 0) %>%
left_join(stops_by_nh, by = "neighborhood") %>%
pivot_longer(c(Black:Latino), names_to = "minority_race",
values_to = "minority_book_rate") %>%
rename(white_book_rate = White)
citation_rate %>%
ggplot(aes(x = white_citation_rate, y = minority_citation_rate)) +
geom_point(aes(size = stops), pch = 21) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
facet_wrap(~ minority_race) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(labels = scales::percent_format()) +
labs(y = "Minority Citation Rate", x = "White Citation Rate",
title = "Comparing Citation Rates by Neighborhood")
book_rate %>%
ggplot(aes(x = white_book_rate, y = minority_book_rate)) +
geom_point(aes(size = stops), pch = 21) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
facet_wrap(~ minority_race) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(labels = scales::percent_format()) +
labs(y = "Minority Book Rate", x = "White Book Rate",
title = "Comparing Book Rates by Neighborhood")
For most neighborhoods, the citation rate is higher or the same for white drivers as it is for black or Latinx drivers.
For book rates though, the majority of neighborhoods, particularly those with the most stops, have lower rates for black/Latinx drivers.
This signals that officers are applying lower threshholds to minority drivers when it comes to offenses that are bookable. I think.