Computational redistricting (user search)

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« on: February 28, 2021, 08:58:15 PM »

Quote from: palandio on February 27, 2021, 05:54:06 PM

I know that drawing maps by yourself is all the fun, but I was interested in exploring redistricting metrics, optimization algorithms and programming, so I started the following project.

The respective state is divided into a certain number of cells. The set of all cells in a state will be called A. Each cell x has a geographic location g(x) and a population p(x). Between two cells x and y you can calculate the Euclidean distance d(g(x),g(y)), for which I will also use the short notation d(x,y).
A Plan is a set of subsets D (called districts) of the set A that form a partition of A. The goal is then to maximize
∑_D∈Plan∑_{x∈D,y∈D,x≠y} p(x)p(y) (f(d(x,y)) + h(x,y))
keeping population sizes in the districts close to equal. Here f is a function that can transform geographical distance and h is a function that works on non-geometric information.
In my experiments I used f=-log (the natural logarithm) and h(x,y)=log(2) if x and y are in the same county and 0 otherwise. A common choice for f is minus the square function, which results in k-means clustering, but here I wanted to try another function. Please feel free to suggest other choices for f and h if you want.

I tried this on Pennsylvania. I draw a plan of ca. 500 cells in DRA (this is the sad part of the exercise and I would be happy to find a better method). Then I exported the map information into a .geojson file. I loaded the .geojson file into a python dictionary and wrote a function to calculate the geographical centers of the polygons (sometimes multi-polygons) that describe the cell geographies. I also wrote a dictionary containing the counties of the cells (I drew the cells so that every cell is contained in one county). The population numbers in the .geojson file were from 2010 I guess, but that's ok for a proof of concept.

Then I played around with a relatively naive optimization algorithm. It's not perfect, but it does create pretty neat results. Some examples:

If you have questions, suggestions or ideas, please post.

I added horizontal divisors to make the three maps more apparent.

On my Iowa maps, I used Census Tracts, and the 2015-2019 5 year estimates which would be a little more current. It would be interesting to see how much difference there was for too maps that used the 2010 Census Data and newer data (with the same number of congressional districts - let's not try to change the apportionment at the same time). Lat/Long for the centroids of the census tracts are available in the Gazetteer.

There should be some demographic coherence for census tracts because they are relatively small and at least when drawn were intended to have statistical meaning.

Since I was using lat/long for my cut lines, I didn't need to project the coordinates, though I did for my displays. If I was calculating a compactness metric I would have.

Since census tracts are defined within counties, there might be a slight tendency to pull the centers of tracts away from county lines, perhaps enhancing clusters within counties. For Pennsylvania, I would guess around 4000 census tracts - I don't know what effect that would have on your optimizations.

p(x)p(y) (f(d(x,y)) + h(x,y))

I think this is saying that you are multiplying the population of cell x by the population of cell y by (f(d(x,y)) + h(x,y)).

But doesn't d(x,y) have units? So how can you take the natural log, and how do you balance with h(x,y) which is unitless?

Am I correct to understand that when you are maximizing the sum you are actually finding the least negative value?

p(x)p(y) is just weighting by all persons in a cell correct? It is as if all persons in a cell lived in a carbon fiber cell tower.

What if all cells had the same population? Then that term can drop out. Would that make optimization more efficient. Let's take a state like Pennsylvania and apportion 5000 districts among the counties and then draw these small districts in each county.

When those maps with a dot for each 5000 persons of a given ethnicity are drawn, how is the position of the dots determined?

The Florida legislature had a website where you submit a plan. Then it would compute a metric of average travel time or distance between any two persons in the district. It also computed the same metric for black persons. The two could be compared to determine whether you were drawing two isolated black populations into the same district, or whether they were relatively compact. I don't know whether travel distance can be efficiently computed during optimization - they were only computing it for final submitted plans, and it was not being displayed as you drew the map.

In Michigan, it appears that using Euclidean distance is tilting the upper portions of the lower peninsula towards the Lake Michigan side, even though people would not ordinarily travel by boat (though in the distance past, the two northern districts were divided as Lake Michigan shore of UP and LP, and Lakes Superior and Huron shores of UP and LP respectively.

Can you draw actual maps, showing the boundaries of the districts?

You could also give a bonus for being in the same UCC or other similar region. This would work better in Michigan.

Michigan Prosperity Regions (PDF)

Michigan Associations of Regions

Can your algorithm draw 50 senate districts or 203 house districts in Pennsylvania?

If you could get a large sample of persons to draw their ideal congressional districts, then you could use that in your distance function. "I regard these people as being district neighbors". Computer-assisted, crowd-sourced, purple flying gerrymanders.

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #1 on: February 28, 2021, 09:10:29 PM »

Quote from: palandio on February 28, 2021, 06:07:34 PM

Some background on why I posted more than one map and on optimization algorithms in general.

The set of possible plans form a very high-dimensional space and the score function
s(Plan) = ∑_D∈Plan∑_{x∈D,y∈D,x≠y} p(x)p(y) (f(d(x,y)) + h(x,y))
is a very complicated function on that space. The problem to find a global optimum is very difficult, but it is possible to find local optima. The possibility to get several distinct plans (and their related scores) when running the (probabilistic) algorithm several times can in my opinion also be seen as a feature. I didn't post the scores for the PA plans, but maybe I will run the algorithm again in the future and post new plans and their scores.

Anyways, I went on to the next state, Michigan. Sadly I haven't incorporated the VRA into my algorithm so far, but data on race is provided in the .geojson file and I have some ideas on how to trigger the algorithm to create VRA districts.

Additionally the population data seems to be from 2010 again, which makes the plans less applicable for 2020 redistricting. I still hope that some trends can be seen and that at some point 2020 census data is available for my 500 cell plans.

Generally the algorithm seems to generate much more stable results for Michigan than for Pennsylvania. Scores from different runs of the algorithm are not that radically different. Also the algorithm tended to provide some unconventional (though probably not optimal) ideas for Pennsylvania, e.g. drawing one seat in Central/West Allegheny and one in East Allegheny and Westmoreland, or having no proper Montgomery seat, or drawing a seat from Erie to Beaver. Things like that seem to happen less often in Michigan.

Best run (probably still not optimal), score: 1390.1

Second best run, score: 1386.6

Most plans (also the ones I don't show here) are very similar outside Wayne/Oakland/Macomb. But one plan splits the Tri-cities and puts Saginaw with Flint (score: 1382.1)

If anyone wants to see what the algorithm says on a specific state, please let me know.

How many runs did you do in Michigan?

Is there any sense of how many runs would be needed to find the optimum plan?

What does the worst-scoring map look like, and what does the distribution of scores look like?

If you displayed scores vs. run number and computed the slope (least squares) would it converge towards zero?

Can the best-scoring plans be considered outliers?

Does a score that is 90% of the best score have any sensible meaning? Can that be characterized as "good try", or just it is a "lower score."

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #2 on: March 03, 2021, 06:15:25 AM »

Quote from: palandio on March 01, 2021, 04:05:42 AM

Quote from: jimrtex on February 28, 2021, 08:58:15 PM

Quote from: palandio on February 27, 2021, 05:54:06 PM

[...]

Census tracts are a good idea. ~~Do you have a link to them and to the most recent population estimates?~~ Edit: Never mind, found it on the census homepage. I would very much like to work with data that is as up-to-date as possible.

The gazetteer has data about census tracts, particularly lat/long.

https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2019.html

I used the 2015-2019 5 year ACS total population for estimates.

Be sure to select 2019 first. The Census Bureau treats different vintages of census tracts as different geographical data. If you select the 2020 census tracts, it will find no data for them.

It turns out that there are 3218 Census Tracts in Pennsylvania.

Sadly my algorithm is a bit slow. I could try to optimize the calculations a bit, but 4000 cells is probably at the higher end of what is doable, if not beyond it.

Quote

p(x)p(y) (f(d(x,y)) + h(x,y))

I think this is saying that you are multiplying the population of cell x by the population of cell y by (f(d(x,y)) + h(x,y)).

But doesn't d(x,y) have units? So how can you take the natural log, and how do you balance with h(x,y) which is unitless?

Good question. As you correctly observe, d(x,y) has units. So before applying f, we have to scrap the unit. The trick is now that changing the units only adds a constant to log(d(x,y)), the same constant for every pair of cells. Hence every plan's score will be shifted by the same constant, which leaves the plans comparable between each other, because the differences in score between two plans remain the same when changing the unit.

Maybe a way to phrase the formula that is better generalizable is
(f(d(x,y) k(x,y)),
taking k(x,y) to be 0.5 if x and y are in the same county and 1 otherwise. This means that cells in the same county are treated as if their distance was only half the distance.

Makes sense. I was thinking that distance had to be normalized somehow (e.g. divided by square root of area of state).

In more rural area the districts will tend to be more compact since there are relatively few intra-county neighbors, and even when in different cells the distance to the neighbors will be short. But in denser areas, the in-county bonus will tend to overwhelm distance. Thus in Pennsylvania you tend to end up with a Montco district, a Bucks district, and two Philadelphia districts, the Delaware district tends to bleed into Philadelphia because it doesn't have enough population. You also get two Allegheny County districts.

The area west of Philadelphia (York, Lancaster, Harrisburg, Reading) is somewhat unstable because the counties are large comparable to a congressional district, but not quite large enough to have their own.

It would be interesting to experiment with the in-county bonus. I assume each run begins with random districts. If you could capture those, then you could find local minimums using different size county bonuses,

Quote

Am I correct to understand that when you are maximizing the sum you are actually finding the least negative value?

That depends on the units. -log(d(x,y)) will take positive values for d(x,y) small enough and negative values for larger d(x,y). I took the units such that the biggest distance between two cells in my maps was smaller than 1 unit, which results in -log(d(x,y)) always being positive. If you took the units such that the smallest distance between two cells was bigger than 1 unit, then -log(d(x,y)) is always negative and you are correct. But as I said changing the unit only shifts the scores for all plans by a constant.

Quote

p(x)p(y) is just weighting by all persons in a cell correct? It is as if all persons in a cell lived in a carbon fiber cell tower.

Exactly.

Quote

What if all cells had the same population? Then that term can drop out. Would that make optimization more efficient. Let's take a state like Pennsylvania and apportion 5000 districts among the counties and then draw these small districts in each county.

Good observation. Originally that was my plan, but then I drew a cell plan of PA whose number of cells wasn't divisible by the number of districts I wanted, which frustrated me. So I went for something more flexible

I think that it could make the optimization more efficient. As far as I know the main problem is not multiplying by the population numbers. That's easy. The main problem is that our optimization problem is finding a maximum subject to equality constraints, namely population equality. If all cells have the same population, then by switching two cells between districts you can go from one plan that fulfills the constraints to another. If on the other hand the cells are not equal in population, the algorithm has a permanent tendency to deviate from population equality towards a state where more densely populated districts are to big and more sparsely populated districts are to small. In order to counteract that I had to add some sort of dynamically estimated Lagrange multipliers.

Do you recalculate the Euclidean distance repeatedly, or do only to adjust your district scores for some combinations? Since the score for any pair of cells doesn't change - it is only which pairs are summed that changes - this might make districts based on travel distance feasible.

Quote

When those maps with a dot for each 5000 persons of a given ethnicity are drawn, how is the position of the dots determined?

I don't know yet, but this could be helpful.

I've always envisioned this as being done by a method that creates equal population districts, and then the dot rather than the boundary represents the district. In effect, the legend for your map might read "each dot represents 25,000 district residents." But maybe they just place dots in areas of high density and clear out the 5,000 nearest residents, before placing the next dot. This might misplace some of the later dots, but will still be representative.

Quote

The Florida legislature had a website where you submit a plan. Then it would compute a metric of average travel time or distance between any two persons in the district. It also computed the same metric for black persons. The two could be compared to determine whether you were drawing two isolated black populations into the same district, or whether they were relatively compact. I don't know whether travel distance can be efficiently computed during optimization - they were only computing it for final submitted plans, and it was not being displayed as you drew the map.

Travel distance would be good and probably more real-world oriented, but I think that "bird-flight" distance has the advantages of easy calculation and of being "a hard geographical fact".

Quote

In Michigan, it appears that using Euclidean distance is tilting the upper portions of the lower peninsula towards the Lake Michigan side, even though people would not ordinarily travel by boat (though in the distance past, the two northern districts were divided as Lake Michigan shore of UP and LP, and Lakes Superior and Huron shores of UP and LP respectively.

Maybe slightly. Although I think that the effect you observed can be largely attributed to gravitational effects of the Grand Traverse mini-metro.

Perhaps in Michigan, you could concentrate all the UP population at the north end of the bridge. When doing the legislative districts for the UP, place a sufficient population at the south end of the bridge to make a whole number of districts.

In general for legislative districts where some counties will have more than one district you can apportion the districts to those counties, decomposing the map into smaller problems. I was trying to think how in Pennsylvania you could split the state into smaller areas. If you were to freeze the eastern part of the state, the districts adjacent to the frozen line would conform to the line. The effect would gradually diminish, just as a change in the Philadelphia area is unlikely to have any effect on how a Pittsburgh district is drawn.

In Iowa, the cells could be the counties, since we know that whole county congressional maps of high levels of equality can be drawn. Outsiders have always speculated that the maps were computer-drawn, but they are actually done by hand. It is possible to draw reasonably compact, highly equal plans by hand. Muon2 has drawn a map that is even more equal, but not compact - but since they were working with the goal of compactness they didn't find it. Perhaps for Iowa the degree of equality could be used in the score - though there is the risk that negligible improvement in equality could override everything else. I think the Census Bureau has center of population for counties, which would be the preferred location for county-based cells. You could use 2019 estimates or project forward to the 2020 census date. If you permitted a modest range in equality (say +/- 2%, you could compare any benefit in compactness vs. equality).

Quote

Can you draw actual maps, showing the boundaries of the districts?

I will definitely do that for "definitive" maps.

Quote

You could also give a bonus for being in the same UCC or other similar region. This would work better in Michigan.

Michigan Prosperity Regions (PDF)

Michigan Associations of Regions

I will do a variant that includes that. It's easy to include that into the algorithm.

I would refrain from including that into my vanilla method, though. As you will have observed, my algorithm heavily rewards putting population hubs that are close together into the same district. So I feel that the algorithm can sort that out by itself and that there is no need for telling it what to do. I agree with you that the general idea of keeping metro areas together is very important and that was one of the motivations that brought me to the project: Finding a method that would keep metro areas together without explicitly being told so. Also if I look at the maps in the posted links, I think that a lot of the rough setup is as similar to my maps as it can get.

Quote

Can your algorithm draw 50 senate districts or 203 house districts in Pennsylvania?

In principle, why not. In practice, my current cell plan is probably a bit too coarse for that, particularly for a house plan. Additionally I would have to evaluate the effect that an increased number of district has on the optimization algorithm. I fear that it could make convergence slower.

Quote

If you could get a large sample of persons to draw their ideal congressional districts, then you could use that in your distance function. "I regard these people as being district neighbors". Computer-assisted, crowd-sourced, purple flying gerrymanders.

Oh yes, that would be superb. It probably goes beyond the scope of this project, though.

Quote from: jimrtex on February 28, 2021, 09:10:29 PM

Quote from: palandio on February 28, 2021, 06:07:34 PM

[...]

How many runs did you do in Michigan?

Just ten. As I said results and scores for Michigan are more stable than for Pennsylvania (at least outside Wayne/Oakland) for whatever reason. I could shorten the runs. I could also try to speed up them somehow.

Quote

Is there any sense of how many runs would be needed to find the optimum plan?

Many, I think.

An additional problem (or feature if you want) is that the dynamically estimated Lagrange multipliers don't let the algorithm converge to just one local optimum. Instead it continues to move a district here and a district there and create new slightly different maps. I then take the best-scored map that occurred during each run.

Quote

What does the worst-scoring map look like, and what does the distribution of scores look like?

I will show. Or maybe I clean up my code and make it available to you. It's just a jupyter notebook and it reads in two simple .txt files.

Quote

If you displayed scores vs. run number and computed the slope (least squares) would it converge towards zero?

I would guess so.

Quote

Can the best-scoring plans be considered outliers?

I don't think so. The best-scoring plans look very similar to the other good plans. It's just that most plans get stuck with a slightly below optimal layout here and there. The less of these sub-optimal decisions are taken, the better plan. But best-scoring plans are very similar to good plans, just slightly better. ;-)

Quote

Does a score that is 90% of the best score have any sensible meaning? Can that be characterized as "good try", or just it is a "lower score."

No, I would rather look at absolute differences than at quotients. As I pointed out, changing the scaling moves all scores by a constant, so the score itself is pretty arbitrary. It would be good if we could generally calculate a score that says 2000 = optimal, 1800 = excellent, 1600 = good, 1400 = average, 1200 = bad, 1000 = #!&?*!, but I don't think that this is realistic. What is realistic in my opinion is to do a number of runs, look at the empirical distribution and hold any plan to that standard.

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #3 on: March 04, 2021, 09:03:55 PM »

Quote from: palandio on March 03, 2021, 06:05:51 PM

Quote from: jimrtex on March 03, 2021, 06:15:25 AM

Do you recalculate the Euclidean distance repeatedly, or do only to adjust your district scores for some combinations? Since the score for any pair of cells doesn't change - it is only which pairs are summed that changes - this might make districts based on travel distance feasible.

I calculate a matrix containing the p(x)p(y) f(d(x,y)· k(x,y)) for all pairs of cells in the beginning (for PA census tracts ca. 10 million entries), then calculate the score for the start plan, and then I update the score at every iteration by only adding and subtracting the updates for the switched cell. The big matrix remains unchanged all the time.

Since F(i,j) = F(j,i), and F(i,i) = 0, you only need to compute and store half as many values. For (0 ≤ i < j ≤ N-1) the

index k(i,j) = (N - (i+1)/2) ) i + (j-i-1)

k(0,1) = 0
k(3216,3217) = 5,176,152

With larger memories - this might not matter.

When the PL 94-171 data comes out, you might want to switch to VTD. In 2010 there were 9256 VTD. This will provide better conformance with local political subdivisions which was the key finding of the Pennsylvania redistricting cases. Perhaps VTD could be merged into smaller cells. But that may be what you did in your original effort.

Quote

In Iowa, the cells could be the counties, since we know that whole county congressional maps of high levels of equality can be drawn. Outsiders have always speculated that the maps were computer-drawn, but they are actually done by hand. It is possible to draw reasonably compact, highly equal plans by hand. Muon2 has drawn a map that is even more equal, but not compact - but since they were working with the goal of compactness they didn't find it. Perhaps for Iowa the degree of equality could be used in the score - though there is the risk that negligible improvement in equality could override everything else. I think the Census Bureau has center of population for counties, which would be the preferred location for county-based cells. You could use 2019 estimates or project forward to the 2020 census date. If you permitted a modest range in equality (say +/- 2%, you could compare any benefit in compactness vs. equality).

Iowa-style redistricting is certainly an interesting challenge. The problem with my algorithm in this regard is that it doesn't really incorporate population inequality into the score. Instead it forbids switching cells from underpopulated districts and switching cells to overpopulated districts. This leads to the resulting plans being a bit too coarse regarding population equality.

Let's make a distinction between the plan score and your algorithm. I might be able to develop a higher scoring plan on my kitchen table than your algorithm.

Let's define a tolerable plan as one in which all districts are within 5%. Then we can define an equality score as the standard deviation (relative to the ideal population). A plan that is significantly more equal (say 0.2%) defeats another plan regardless of compactness. Two plans that are about as equal (say within 0.2%) can be compared based on compactness.

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #4 on: March 06, 2021, 09:20:51 AM »

Quote from: palandio on March 05, 2021, 03:38:30 PM

Oh yes, I insist upon the distinction between score/metric and algorithm all the time, so you're right for calling me out on that. Which opens up the next question: Calculating the score for a plan that is already in the format of my python program is fast, but how to bring the plans from the kitchen table into the format. I don't want to rewrite DRA. ;-)

How difficult would it be export plans from your program or import them to it. I'm thinking of a text file with GeoID district number pairs.

If you could export it, is trivial to bring into QGIS and produce all kinds of maps. If you could import them, then I could send you a plan, and you could score it or possibly use it as a starting point for refinement.

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #5 on: March 06, 2021, 09:52:14 PM »

Quote from: palandio on March 06, 2021, 10:31:40 AM

Quote from: jimrtex on March 06, 2021, 09:20:51 AM

Quote from: palandio on March 05, 2021, 03:38:30 PM

You mean a text file containing something like the following text?

1400000US42001030101, 9
1400000US42001030102, 13
1400000US42001030200, 2
...

(The number behind the comma being the district the census tract is assigned to.)
Both importing and exporting them would be easy.
Using QGIS is probably a good idea.

I was thinking of the part behind the US, but either would work. Shapefiles from the Census Bureau include both.

The 140 identifies these as Census Tracts. I think the 0000 may be a reserved field, and the US may identify the country (I think there is a program to coordinate North American GIS codes). The Tract ID is formed by three sub-fields.

42 001 030101

42 = Pennsylvania
001 = Adams County
030101 = CT 301.01

In maps and printed material leading zeroes are suppressed. The last two digits are a subfield, displayed as a suffix, that usually indicates a derivate CT. CT 301.01 and 301.02 resulted from a division of CT 301. If the last two digits are 00, they are also suppressed in maps and printed material.

What I ordinarily do is to download the shapefiles from the Census Bureau. I can drag the .shp file into the canvas of QGIS, and open the .dbf file in Excel. I might strip out some of the extraneous fields. I save this as an .xlsx file. Excel really doesn't like writing .dbf files, and I also want to preserve source files.

If I am redistricting, I add a field for district number/name, and then do a sub-table for accumulating the population of the districts, and simply assign tracts/counties, etc. by filling in the district number/name.

Then I create a .csv file from selected fields, which might be as simple as the Geoid, and the district number. I load that into QGIS, join it to my shapefile, and then create maps.

I do my editing in Excel, so am repeatedly making updates to my .xlsx file, copying to the .csv file, saving that. I then remove the .csv file in QGIS, and redo the add of the .csv file and the join.

I would eliminate the space after the comma. When you write a .csv file with Excel it does not use a space, and does not quote text unless there is a comma in the text. So a city with multiple words would appear as

123456,San Francisco,856472

Also, add a header row such as

AFFGEOID,STATE,COUNTY,TRACT,DISTRICT

Having the state ID, county ID, and tract may help locate the tract when visually looking at the .csv (though .csv files are primarily for computer interchange - and all computers comprehend ASCII).

When I was doing the Iowa maps, I output all the districts in a single file:

e.g. RUN1,RUN2,RUN3, etc.

I would probably do the same with your output - it would be easier than working with multiple files.

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #6 on: March 07, 2021, 12:02:15 PM »

You may be interested in this.

Migration and commuting interactions fields: a new geography with community detection algorithm?

One item I found particularly interesting was the measure of stability at paragraph 21, which indicates the percentage of runs that a node was in same area.

Perhaps you could keep a count of times each pair of nodes ended up in the same district. If this is a high percentage, the nodes could be considered close friends.

Nodes with very high number of close friends could form the core of districts.

Logged

jimrtex
Atlas Icon

Posts: 11,817

Re: Computational redistricting

« Reply #7 on: March 10, 2021, 08:48:49 AM »

I thought of another approach.

Imagine Pennsylvania had retained 18 districts and you simply wanted to equalize population. You could take your algorithm and award a bonus for census tracts that are currently in the same district.

You could begin with random districts but would tend to return towards the current districts. There would be tension between maintaining the current districts and making them more compact while equalizing population. Generally, this would minimize the number of persons assigned to new districts.

Now repeat with 17 districts for a few districts, the bonus will not be sufficient to keep them intact, but may retain larger segments.

Logged

Pages: [1]

	Welcome, Guest. Please login or register. Did you miss your activation email? April 30, 2024, 04:05:24 AM

News: Election Simulator 2.0 Released. Senate/Gubernatorial maps, proportional electoral votes, and more - Read more