Computational redistricting
       |           

Welcome, Guest. Please login or register.
Did you miss your activation email?
April 30, 2024, 10:53:21 AM
News: Election Simulator 2.0 Released. Senate/Gubernatorial maps, proportional electoral votes, and more - Read more

  Talk Elections
  General Politics
  Political Geography & Demographics (Moderators: muon2, 100% pro-life no matter what)
  Computational redistricting
« previous next »
Pages: [1]
Author Topic: Computational redistricting  (Read 900 times)
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« on: February 27, 2021, 05:54:06 PM »
« edited: March 01, 2021, 09:35:47 AM by palandio »

I know that drawing maps by yourself is all the fun, but I was interested in exploring redistricting metrics, optimization algorithms and programming, so I started the following project.

The respective state is divided into a certain number of cells. The set of all cells in a state will be called A. Each cell x has a geographic location g(x) and a population p(x). Between two cells x and y you can calculate the Euclidean distance d(g(x),g(y)), for which I will also use the short notation d(x,y).
A Plan is a set of subsets D (called districts) of the set A that form a partition of A. The goal is then to maximize
D∈Planx∈D,y∈D,x≠y p(x)p(y) (f(d(x,y)) + h(x,y))
keeping population sizes in the districts close to equal. Here f is a function that can transform geographical distance and h is a function that works on non-geometric information.
In my experiments I used f=-log (the natural logarithm) and h(x,y)=log(2) if x and y are in the same county and 0 otherwise. A common choice for f is minus the square function, which results in k-means clustering, but here I wanted to try another function. Please feel free to suggest other choices for f and h if you want.

I tried this on Pennsylvania. I draw a plan of ca. 500 cells in DRA (this is the sad part of the exercise and I would be happy to find a better method). Then I exported the map information into a .geojson file. I loaded the .geojson file into a python dictionary and wrote a function to calculate the geographical centers of the polygons (sometimes multi-polygons) that describe the cell geographies. I also wrote a dictionary containing the counties of the cells (I drew the cells so that every cell is contained in one county). The population numbers in the .geojson file were from 2010 I guess, but that's ok for a proof of concept.

Then I played around with a relatively naive optimization algorithm. It's not perfect, but it does create pretty neat results. Some examples:






If you have questions, suggestions or ideas, please post.

Edit: Inserted lines between maps like proposed by jimrtex
Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #1 on: February 28, 2021, 06:07:34 PM »

Some background on why I posted more than one map and on optimization algorithms in general.

The set of possible plans form a very high-dimensional space and the score function
s(Plan) = D∈Planx∈D,y∈D,x≠y p(x)p(y) (f(d(x,y)) + h(x,y))
is a very complicated function on that space. The problem to find a global optimum is very difficult, but it is possible to find local optima. The possibility to get several distinct plans (and their related scores) when running the (probabilistic) algorithm several times can in my opinion also be seen as a feature. I didn't post the scores for the PA plans, but maybe I will run the algorithm again in the future and post new plans and their scores.

Anyways, I went on to the next state, Michigan. Sadly I haven't incorporated the VRA into my algorithm so far, but data on race is provided in the .geojson file and I have some ideas on how to trigger the algorithm to create VRA districts.

Additionally the population data seems to be from 2010 again, which makes the plans less applicable for 2020 redistricting. I still hope that some trends can be seen and that at some point 2020 census data is available for my 500 cell plans.

Generally the algorithm seems to generate much more stable results for Michigan than for Pennsylvania. Scores from different runs of the algorithm are not that radically different. Also the algorithm tended to provide some unconventional (though probably not optimal) ideas for Pennsylvania, e.g. drawing one seat in Central/West Allegheny and one in East Allegheny and Westmoreland, or having no proper Montgomery seat, or drawing a seat from Erie to Beaver. Things like that seem to happen less often in Michigan.

Best run (probably still not optimal), score: 1390.1


Second best run, score: 1386.6


Most plans (also the ones I don't show here) are very similar outside Wayne/Oakland/Macomb. But one plan splits the Tri-cities and puts Saginaw with Flint (score: 1382.1)


If anyone wants to see what the algorithm says on a specific state, please let me know.
Logged
President Punxsutawney Phil
TimTurner
Atlas Politician
Atlas Legend
*****
Posts: 41,392
United States


Show only this user's posts in this thread
« Reply #2 on: February 28, 2021, 06:48:38 PM »

Texas?
Logged
Born to Slay. Forced to Work.
leecannon
Junior Chimp
*****
Posts: 6,957
United States


Political Matrix
E: -6.45, S: -6.78

Show only this user's posts in this thread
« Reply #3 on: February 28, 2021, 08:39:05 PM »

If you could overlay a state map/make an approximate of them in Dave’s that’d be really cool and helpful in visualizing them, especially when there’s two with similar colors next to each other.

I’d also be interested to see South Carolina and Maryland
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #4 on: February 28, 2021, 08:58:15 PM »

I know that drawing maps by yourself is all the fun, but I was interested in exploring redistricting metrics, optimization algorithms and programming, so I started the following project.

The respective state is divided into a certain number of cells. The set of all cells in a state will be called A. Each cell x has a geographic location g(x) and a population p(x). Between two cells x and y you can calculate the Euclidean distance d(g(x),g(y)), for which I will also use the short notation d(x,y).
A Plan is a set of subsets D (called districts) of the set A that form a partition of A. The goal is then to maximize
D∈Planx∈D,y∈D,x≠y p(x)p(y) (f(d(x,y)) + h(x,y))
keeping population sizes in the districts close to equal. Here f is a function that can transform geographical distance and h is a function that works on non-geometric information.
In my experiments I used f=-log (the natural logarithm) and h(x,y)=log(2) if x and y are in the same county and 0 otherwise. A common choice for f is minus the square function, which results in k-means clustering, but here I wanted to try another function. Please feel free to suggest other choices for f and h if you want.

I tried this on Pennsylvania. I draw a plan of ca. 500 cells in DRA (this is the sad part of the exercise and I would be happy to find a better method). Then I exported the map information into a .geojson file. I loaded the .geojson file into a python dictionary and wrote a function to calculate the geographical centers of the polygons (sometimes multi-polygons) that describe the cell geographies. I also wrote a dictionary containing the counties of the cells (I drew the cells so that every cell is contained in one county). The population numbers in the .geojson file were from 2010 I guess, but that's ok for a proof of concept.

Then I played around with a relatively naive optimization algorithm. It's not perfect, but it does create pretty neat results. Some examples:






If you have questions, suggestions or ideas, please post.

I added horizontal divisors to make the three maps more apparent.

On my Iowa maps, I used Census Tracts, and the 2015-2019 5 year estimates which would be a little more current. It would be interesting to see how much difference there was for too maps that used the 2010 Census Data and newer data (with the same number of congressional districts - let's not try to change the apportionment at the same time). Lat/Long for the centroids of the census tracts are available in the Gazetteer.

There should be some demographic coherence for census tracts because they are relatively small and at least when drawn were intended to have statistical meaning.

Since I was using lat/long for my cut lines, I didn't need to project the coordinates, though I did for my displays. If I was calculating a compactness metric I would have.

Since census tracts are defined within counties, there might be a slight tendency to pull the centers of tracts away from county lines, perhaps enhancing clusters within counties. For Pennsylvania, I would guess around 4000 census tracts - I don't know what effect that would have on your optimizations.

p(x)p(y) (f(d(x,y)) + h(x,y))

I think this is saying that you are multiplying the population of cell x by the population of cell y by (f(d(x,y)) + h(x,y)).

But doesn't d(x,y) have units? So how can you take the natural log, and how do you balance with h(x,y) which is unitless?

Am I correct to understand that when you are maximizing the sum you are actually finding the least negative value?

p(x)p(y) is just weighting by all persons in a cell correct? It is as if all persons in a cell lived in a carbon fiber cell tower.

What if all cells had the same population? Then that term can drop out. Would that make optimization more efficient. Let's take a state like Pennsylvania and apportion 5000 districts among the counties and then draw these small districts in each county.

When those maps with a dot for each 5000 persons of a given ethnicity are drawn, how is the position of the dots determined?

The Florida legislature had a website where you submit a plan. Then it would compute a metric of average travel time or distance between any two persons in the district. It also computed the same metric for black persons. The two could be compared to determine whether you were drawing two isolated black populations into the same district, or whether they were relatively compact. I don't know whether travel distance can be efficiently computed during optimization - they were only computing it for final submitted plans, and it was not being displayed as you drew the map.

In Michigan, it appears that using Euclidean distance is tilting the upper portions of the lower peninsula towards the Lake Michigan side, even though people would not ordinarily travel by boat (though in the distance past, the two northern districts were divided as Lake Michigan shore of UP and LP, and Lakes Superior and Huron shores of UP and LP respectively.

Can you draw actual maps, showing the boundaries of the districts?

You could also give a bonus for being in the same UCC or other similar region. This would work better in Michigan.

Michigan Prosperity Regions (PDF)

Michigan Associations of Regions

Can your algorithm draw 50 senate districts or 203 house districts in Pennsylvania?

If you could get a large sample of persons to draw their ideal congressional districts, then you could use that in your distance function. "I regard these people as being district neighbors". Computer-assisted, crowd-sourced, purple flying gerrymanders.
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #5 on: February 28, 2021, 09:10:29 PM »

Some background on why I posted more than one map and on optimization algorithms in general.

The set of possible plans form a very high-dimensional space and the score function
s(Plan) = D∈Planx∈D,y∈D,x≠y p(x)p(y) (f(d(x,y)) + h(x,y))
is a very complicated function on that space. The problem to find a global optimum is very difficult, but it is possible to find local optima. The possibility to get several distinct plans (and their related scores) when running the (probabilistic) algorithm several times can in my opinion also be seen as a feature. I didn't post the scores for the PA plans, but maybe I will run the algorithm again in the future and post new plans and their scores.

Anyways, I went on to the next state, Michigan. Sadly I haven't incorporated the VRA into my algorithm so far, but data on race is provided in the .geojson file and I have some ideas on how to trigger the algorithm to create VRA districts.

Additionally the population data seems to be from 2010 again, which makes the plans less applicable for 2020 redistricting. I still hope that some trends can be seen and that at some point 2020 census data is available for my 500 cell plans.

Generally the algorithm seems to generate much more stable results for Michigan than for Pennsylvania. Scores from different runs of the algorithm are not that radically different. Also the algorithm tended to provide some unconventional (though probably not optimal) ideas for Pennsylvania, e.g. drawing one seat in Central/West Allegheny and one in East Allegheny and Westmoreland, or having no proper Montgomery seat, or drawing a seat from Erie to Beaver. Things like that seem to happen less often in Michigan.

Best run (probably still not optimal), score: 1390.1


Second best run, score: 1386.6


Most plans (also the ones I don't show here) are very similar outside Wayne/Oakland/Macomb. But one plan splits the Tri-cities and puts Saginaw with Flint (score: 1382.1)


If anyone wants to see what the algorithm says on a specific state, please let me know.
How many runs did you do in Michigan?

Is there any sense of how many runs would be needed to find the optimum plan?

What does the worst-scoring map look like, and what does the distribution of scores look like?

If you displayed scores vs. run number and computed the slope (least squares) would it converge towards zero?

Can the best-scoring plans be considered outliers?

Does a score that is 90% of the best score have any sensible meaning? Can that be characterized as "good try", or just it is a "lower score."
Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #6 on: March 01, 2021, 02:11:59 AM »

Ooooh, I feared that request. :-D
I will do that, but try to solve the VRA issue before.

If you could overlay a state map/make an approximate of them in Dave’s that’d be really cool and helpful in visualizing them, especially when there’s two with similar colors next to each other.

I’d also be interested to see South Carolina and Maryland
The approximate in DRA is definitely doable. I just didn't deem it necessary for a quick proof of concept, but I understand that visually it will add a lot, so I will do that in the future.

South Carolina and Maryland are definitely good ideas.
Logged
President Punxsutawney Phil
TimTurner
Atlas Politician
Atlas Legend
*****
Posts: 41,392
United States


Show only this user's posts in this thread
« Reply #7 on: March 01, 2021, 03:28:13 AM »

Ooooh, I feared that request. :-D
I will do that, but try to solve the VRA issue before.
Take all the time you need. It's a task as big as Texas. Tongue
Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #8 on: March 01, 2021, 04:05:42 AM »
« Edited: March 01, 2021, 07:31:27 AM by palandio »


I added horizontal divisors to make the three maps more apparent.

On my Iowa maps, I used Census Tracts, and the 2015-2019 5 year estimates which would be a little more current. It would be interesting to see how much difference there was for too maps that used the 2010 Census Data and newer data (with the same number of congressional districts - let's not try to change the apportionment at the same time). Lat/Long for the centroids of the census tracts are available in the Gazetteer.

There should be some demographic coherence for census tracts because they are relatively small and at least when drawn were intended to have statistical meaning.

Since I was using lat/long for my cut lines, I didn't need to project the coordinates, though I did for my displays. If I was calculating a compactness metric I would have.

Since census tracts are defined within counties, there might be a slight tendency to pull the centers of tracts away from county lines, perhaps enhancing clusters within counties. For Pennsylvania, I would guess around 4000 census tracts - I don't know what effect that would have on your optimizations.

Census tracts are a good idea. Do you have a link to them and to the most recent population estimates? Edit: Never mind, found it on the census homepage. I would very much like to work with data that is as up-to-date  as possible.

Sadly my algorithm is a bit slow. I could try to optimize the calculations a bit, but 4000 cells is probably at the higher end of what is doable, if not beyond it.

Quote
p(x)p(y) (f(d(x,y)) + h(x,y))

I think this is saying that you are multiplying the population of cell x by the population of cell y by (f(d(x,y)) + h(x,y)).

But doesn't d(x,y) have units? So how can you take the natural log, and how do you balance with h(x,y) which is unitless?

Good question. As you correctly observe, d(x,y) has units. So before applying f, we have to scrap the unit. The trick is now that changing the units only adds a constant to log(d(x,y)), the same constant for every pair of cells. Hence every plan's score will be shifted by the same constant, which leaves the plans comparable between each other, because the differences in score between two plans remain the same when changing the unit.

Maybe a way to phrase the formula that is better generalizable is
(f(d(x,y) k(x,y)),
taking k(x,y) to be 0.5 if x and y are in the same county and 1 otherwise. This means that cells in the same county are treated as if their distance was only half the distance.

Quote
Am I correct to understand that when you are maximizing the sum you are actually finding the least negative value?

That depends on the units. -log(d(x,y)) will take positive values for d(x,y) small enough and negative values for larger d(x,y). I took the units such that the biggest distance between two cells in my maps was smaller than 1 unit, which results in -log(d(x,y)) always being positive. If you took the units such that the smallest distance between two cells was bigger than 1 unit, then -log(d(x,y)) is always negative and you are correct. But as I said changing the unit only shifts the scores for all plans by a constant.

Quote
p(x)p(y) is just weighting by all persons in a cell correct? It is as if all persons in a cell lived in a carbon fiber cell tower.

Exactly.

Quote
What if all cells had the same population? Then that term can drop out. Would that make optimization more efficient. Let's take a state like Pennsylvania and apportion 5000 districts among the counties and then draw these small districts in each county.

Good observation. Originally that was my plan, but then I drew a cell plan of PA whose number of cells wasn't divisible by the number of districts I wanted, which frustrated me. So I went for something more flexible

I think that it could make the optimization more efficient. As far as I know the main problem is not multiplying by the population numbers. That's easy. The main problem is that our optimization problem is finding a maximum subject to equality constraints, namely population equality. If all cells have the same population, then by switching two cells between districts you can go from one plan that fulfills the constraints to another. If on the other hand the cells are not equal in population, the algorithm has a permanent tendency to deviate from population equality towards a state where more densely populated districts are to big and more sparsely populated districts are to small. In order to counteract that I had to add some sort of dynamically estimated Lagrange multipliers.

Quote
When those maps with a dot for each 5000 persons of a given ethnicity are drawn, how is the position of the dots determined?

I don't know yet, but this could be helpful.

Quote
The Florida legislature had a website where you submit a plan. Then it would compute a metric of average travel time or distance between any two persons in the district. It also computed the same metric for black persons. The two could be compared to determine whether you were drawing two isolated black populations into the same district, or whether they were relatively compact. I don't know whether travel distance can be efficiently computed during optimization - they were only computing it for final submitted plans, and it was not being displayed as you drew the map.

Travel distance would be good and probably more real-world oriented, but I think that "bird-flight" distance has the advantages of easy calculation and of being "a hard geographical fact".

Quote
In Michigan, it appears that using Euclidean distance is tilting the upper portions of the lower peninsula towards the Lake Michigan side, even though people would not ordinarily travel by boat (though in the distance past, the two northern districts were divided as Lake Michigan shore of UP and LP, and Lakes Superior and Huron shores of UP and LP respectively.

Maybe slightly. Although I think that the effect you observed can be largely attributed to gravitational effects of the Grand Traverse mini-metro.

Quote
Can you draw actual maps, showing the boundaries of the districts?
I will definitely do that for "definitive" maps.
Quote
You could also give a bonus for being in the same UCC or other similar region. This would work better in Michigan.

Michigan Prosperity Regions (PDF)

Michigan Associations of Regions

I will do a variant that includes that. It's easy to include that into the algorithm.

I would refrain from including that into my vanilla method, though. As you will have observed, my algorithm heavily rewards putting population hubs that are close together into the same district. So I feel that the algorithm can sort that out by itself and that there is no need for telling it what to do. I agree with you that the general idea of keeping metro areas together is very important and that was one of the motivations that brought me to the project: Finding a method that would keep metro areas together without explicitly being told so. Also if I look at the maps in the posted links, I think that a lot of the rough setup is as similar to my maps as it can get.

Quote
Can your algorithm draw 50 senate districts or 203 house districts in Pennsylvania?

In principle, why not. In practice, my current cell plan is probably a bit too coarse for that, particularly for a house plan. Additionally I would have to evaluate the effect that an increased number of district has on the optimization algorithm. I fear that it could make convergence slower.

Quote
If you could get a large sample of persons to draw their ideal congressional districts, then you could use that in your distance function. "I regard these people as being district neighbors". Computer-assisted, crowd-sourced, purple flying gerrymanders.

Oh yes, that would be superb. It probably goes beyond the scope of this project, though.


Just ten. As I said results and scores for Michigan are more stable than for Pennsylvania (at least outside Wayne/Oakland) for whatever reason. I could shorten the runs. I could also try to speed up them somehow.

Quote
Is there any sense of how many runs would be needed to find the optimum plan?
Many, I think.

An additional problem (or feature if you want) is that the dynamically estimated Lagrange multipliers don't let the algorithm converge to just one local optimum. Instead it continues to move a district here and a district there and create new slightly different maps. I then take the best-scored map that occurred during each run.

Quote
What does the worst-scoring map look like, and what does the distribution of scores look like?

I will show. Or maybe I clean up my code and make it available to you. It's just a jupyter notebook and it reads in two simple .txt files.

Quote
If you displayed scores vs. run number and computed the slope (least squares) would it converge towards zero?

I would guess so.

Quote
Can the best-scoring plans be considered outliers?

I don't think so. The best-scoring plans look very similar to the other good plans. It's just that most plans get stuck with a slightly below optimal layout here and there. The less of these sub-optimal decisions are taken, the better plan. But best-scoring plans are very similar to good plans, just slightly better. ;-)

Quote
Does a score that is 90% of the best score have any sensible meaning? Can that be characterized as "good try", or just it is a "lower score."

No, I would rather look at absolute differences than at quotients. As I pointed out, changing the scaling moves all scores by a constant, so the score itself is pretty arbitrary. It would be good if we could generally calculate a score that says 2000 = optimal, 1800 = excellent, 1600 = good, 1400 = average, 1200 = bad, 1000 = #!&?*!, but I don't think that this is realistic. What is realistic in my opinion is to do a number of runs, look at the empirical distribution and hold any plan to that standard.
Logged
President Punxsutawney Phil
TimTurner
Atlas Politician
Atlas Legend
*****
Posts: 41,392
United States


Show only this user's posts in this thread
« Reply #9 on: March 01, 2021, 04:35:11 AM »

Smaller mini-request: Kansas
Logged
Vern
vern1988
Sr. Member
****
Posts: 2,203
United States


Political Matrix
E: -1.30, S: -0.70

P P P

Show only this user's posts in this thread
« Reply #10 on: March 01, 2021, 10:12:59 PM »

North Carolina please!
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #11 on: March 03, 2021, 06:15:25 AM »


I added horizontal divisors to make the three maps more apparent.

On my Iowa maps, I used Census Tracts, and the 2015-2019 5 year estimates which would be a little more current. It would be interesting to see how much difference there was for too maps that used the 2010 Census Data and newer data (with the same number of congressional districts - let's not try to change the apportionment at the same time). Lat/Long for the centroids of the census tracts are available in the Gazetteer.

There should be some demographic coherence for census tracts because they are relatively small and at least when drawn were intended to have statistical meaning.

Since I was using lat/long for my cut lines, I didn't need to project the coordinates, though I did for my displays. If I was calculating a compactness metric I would have.

Since census tracts are defined within counties, there might be a slight tendency to pull the centers of tracts away from county lines, perhaps enhancing clusters within counties. For Pennsylvania, I would guess around 4000 census tracts - I don't know what effect that would have on your optimizations.

Census tracts are a good idea. Do you have a link to them and to the most recent population estimates? Edit: Never mind, found it on the census homepage. I would very much like to work with data that is as up-to-date  as possible.

The gazetteer has data about census tracts, particularly lat/long.

https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2019.html

I used the 2015-2019 5 year ACS total population for estimates.

Be sure to select 2019 first. The Census Bureau treats different vintages of census tracts as different geographical data. If you select the 2020 census tracts, it will find no data for them.

It turns out that there are 3218 Census Tracts in Pennsylvania.


Sadly my algorithm is a bit slow. I could try to optimize the calculations a bit, but 4000 cells is probably at the higher end of what is doable, if not beyond it.

Quote
p(x)p(y) (f(d(x,y)) + h(x,y))

I think this is saying that you are multiplying the population of cell x by the population of cell y by (f(d(x,y)) + h(x,y)).

But doesn't d(x,y) have units? So how can you take the natural log, and how do you balance with h(x,y) which is unitless?

Good question. As you correctly observe, d(x,y) has units. So before applying f, we have to scrap the unit. The trick is now that changing the units only adds a constant to log(d(x,y)), the same constant for every pair of cells. Hence every plan's score will be shifted by the same constant, which leaves the plans comparable between each other, because the differences in score between two plans remain the same when changing the unit.

Maybe a way to phrase the formula that is better generalizable is
(f(d(x,y) k(x,y)),
taking k(x,y) to be 0.5 if x and y are in the same county and 1 otherwise. This means that cells in the same county are treated as if their distance was only half the distance.

Makes sense. I was thinking that distance had to be normalized somehow (e.g. divided by square root of area of state).

In more rural area the districts will tend to be more compact since there are relatively few intra-county neighbors, and even when in different cells the distance to the neighbors will be short. But in denser areas, the in-county bonus will tend to overwhelm distance. Thus in Pennsylvania you tend to end up with a Montco district, a Bucks district, and two Philadelphia districts, the Delaware district tends to bleed into Philadelphia because it doesn't have enough population. You also get two Allegheny County districts.

The area west of Philadelphia (York, Lancaster, Harrisburg, Reading) is somewhat unstable because the counties are large comparable to a congressional district, but not quite large enough to have their own.

It would be interesting to experiment with the in-county bonus. I assume each run begins with random districts. If you could capture those, then you could find local minimums using different size county bonuses,


Quote
Am I correct to understand that when you are maximizing the sum you are actually finding the least negative value?

That depends on the units. -log(d(x,y)) will take positive values for d(x,y) small enough and negative values for larger d(x,y). I took the units such that the biggest distance between two cells in my maps was smaller than 1 unit, which results in -log(d(x,y)) always being positive. If you took the units such that the smallest distance between two cells was bigger than 1 unit, then -log(d(x,y)) is always negative and you are correct. But as I said changing the unit only shifts the scores for all plans by a constant.

Quote
p(x)p(y) is just weighting by all persons in a cell correct? It is as if all persons in a cell lived in a carbon fiber cell tower.

Exactly.

Quote
What if all cells had the same population? Then that term can drop out. Would that make optimization more efficient. Let's take a state like Pennsylvania and apportion 5000 districts among the counties and then draw these small districts in each county.

Good observation. Originally that was my plan, but then I drew a cell plan of PA whose number of cells wasn't divisible by the number of districts I wanted, which frustrated me. So I went for something more flexible

I think that it could make the optimization more efficient. As far as I know the main problem is not multiplying by the population numbers. That's easy. The main problem is that our optimization problem is finding a maximum subject to equality constraints, namely population equality. If all cells have the same population, then by switching two cells between districts you can go from one plan that fulfills the constraints to another. If on the other hand the cells are not equal in population, the algorithm has a permanent tendency to deviate from population equality towards a state where more densely populated districts are to big and more sparsely populated districts are to small. In order to counteract that I had to add some sort of dynamically estimated Lagrange multipliers.

Do you recalculate the Euclidean distance repeatedly, or do only to adjust your district scores for some combinations? Since the score for any pair of cells doesn't change - it is only which pairs are summed that changes - this might make districts based on travel distance feasible.


Quote
When those maps with a dot for each 5000 persons of a given ethnicity are drawn, how is the position of the dots determined?

I don't know yet, but this could be helpful.

I've always envisioned this as being done by a method that creates equal population districts, and then the dot rather than the boundary represents the district. In effect, the legend for your map might read "each dot represents 25,000 district residents." But maybe they just place dots in areas of high density and clear out the 5,000 nearest residents, before placing the next dot. This might misplace some of the later dots, but will still be representative.

Quote
The Florida legislature had a website where you submit a plan. Then it would compute a metric of average travel time or distance between any two persons in the district. It also computed the same metric for black persons. The two could be compared to determine whether you were drawing two isolated black populations into the same district, or whether they were relatively compact. I don't know whether travel distance can be efficiently computed during optimization - they were only computing it for final submitted plans, and it was not being displayed as you drew the map.

Travel distance would be good and probably more real-world oriented, but I think that "bird-flight" distance has the advantages of easy calculation and of being "a hard geographical fact".

Quote
In Michigan, it appears that using Euclidean distance is tilting the upper portions of the lower peninsula towards the Lake Michigan side, even though people would not ordinarily travel by boat (though in the distance past, the two northern districts were divided as Lake Michigan shore of UP and LP, and Lakes Superior and Huron shores of UP and LP respectively.

Maybe slightly. Although I think that the effect you observed can be largely attributed to gravitational effects of the Grand Traverse mini-metro.

Perhaps in Michigan, you could concentrate all the UP population at the north end of the bridge. When doing the legislative districts for the UP, place a sufficient population at the south end of the bridge to make a whole number of districts.

In general for legislative districts where some counties will have more than one district you can apportion the districts to those counties, decomposing the map into smaller problems. I was trying to think how in Pennsylvania you could split the state into smaller areas. If you were to freeze the eastern part of the state, the districts adjacent to the frozen line would conform to the line. The effect would gradually diminish, just as a change in the Philadelphia area is unlikely to have any effect on how a Pittsburgh district is drawn.

In Iowa, the cells could be the counties, since we know that whole county congressional maps of high levels of equality can be drawn. Outsiders have always speculated that the maps were computer-drawn, but they are actually done by hand. It is possible to draw reasonably compact, highly equal plans by hand. Muon2 has drawn a map that is even more equal, but not compact - but since they were working with the goal of compactness they didn't find it. Perhaps for Iowa the degree of equality could be used in the score - though there is the risk that negligible improvement in equality could override everything else. I think the Census Bureau has center of population for counties, which would be the preferred location for county-based cells. You could use 2019 estimates or project forward to the 2020 census date. If you permitted a modest range in equality (say +/- 2%, you could compare any benefit in compactness vs. equality).


Quote
Can you draw actual maps, showing the boundaries of the districts?
I will definitely do that for "definitive" maps.
Quote
You could also give a bonus for being in the same UCC or other similar region. This would work better in Michigan.

Michigan Prosperity Regions (PDF)

Michigan Associations of Regions

I will do a variant that includes that. It's easy to include that into the algorithm.

I would refrain from including that into my vanilla method, though. As you will have observed, my algorithm heavily rewards putting population hubs that are close together into the same district. So I feel that the algorithm can sort that out by itself and that there is no need for telling it what to do. I agree with you that the general idea of keeping metro areas together is very important and that was one of the motivations that brought me to the project: Finding a method that would keep metro areas together without explicitly being told so. Also if I look at the maps in the posted links, I think that a lot of the rough setup is as similar to my maps as it can get.

Quote
Can your algorithm draw 50 senate districts or 203 house districts in Pennsylvania?

In principle, why not. In practice, my current cell plan is probably a bit too coarse for that, particularly for a house plan. Additionally I would have to evaluate the effect that an increased number of district has on the optimization algorithm. I fear that it could make convergence slower.

Quote
If you could get a large sample of persons to draw their ideal congressional districts, then you could use that in your distance function. "I regard these people as being district neighbors". Computer-assisted, crowd-sourced, purple flying gerrymanders.

Oh yes, that would be superb. It probably goes beyond the scope of this project, though.


Just ten. As I said results and scores for Michigan are more stable than for Pennsylvania (at least outside Wayne/Oakland) for whatever reason. I could shorten the runs. I could also try to speed up them somehow.

Quote
Is there any sense of how many runs would be needed to find the optimum plan?
Many, I think.

An additional problem (or feature if you want) is that the dynamically estimated Lagrange multipliers don't let the algorithm converge to just one local optimum. Instead it continues to move a district here and a district there and create new slightly different maps. I then take the best-scored map that occurred during each run.

Quote
What does the worst-scoring map look like, and what does the distribution of scores look like?

I will show. Or maybe I clean up my code and make it available to you. It's just a jupyter notebook and it reads in two simple .txt files.

Quote
If you displayed scores vs. run number and computed the slope (least squares) would it converge towards zero?

I would guess so.

Quote
Can the best-scoring plans be considered outliers?

I don't think so. The best-scoring plans look very similar to the other good plans. It's just that most plans get stuck with a slightly below optimal layout here and there. The less of these sub-optimal decisions are taken, the better plan. But best-scoring plans are very similar to good plans, just slightly better. ;-)

Quote
Does a score that is 90% of the best score have any sensible meaning? Can that be characterized as "good try", or just it is a "lower score."

No, I would rather look at absolute differences than at quotients. As I pointed out, changing the scaling moves all scores by a constant, so the score itself is pretty arbitrary. It would be good if we could generally calculate a score that says 2000 = optimal, 1800 = excellent, 1600 = good, 1400 = average, 1200 = bad, 1000 = #!&?*!, but I don't think that this is realistic. What is realistic in my opinion is to do a number of runs, look at the empirical distribution and hold any plan to that standard.
Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #12 on: March 03, 2021, 06:05:51 PM »

The gazetteer has data about census tracts, particularly lat/long.

https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2019.html

I used the 2015-2019 5 year ACS total population for estimates.

Be sure to select 2019 first. The Census Bureau treats different vintages of census tracts as different geographical data. If you select the 2020 census tracts, it will find no data for them.

It turns out that there are 3218 Census Tracts in Pennsylvania.

I'm going to use census tracts with lat/long centroid data from 2010 and population estimates from the 2015-2019 five year ACS. If I understand correctly, censustract geography changes only once a decade, so that should be ok.

Quote
Makes sense. I was thinking that distance had to be normalized somehow (e.g. divided by square root of area of state).

I was asking myself if distances had to be normalized and found out that choosing f in the formula D∈Planx∈D,y∈D,x≠y p(x)p(y) f(d(x,y)· k(x,y)) to be either the logarithm or -xα or x with α>0 means that changing the scale of the distance means just a linear transformation of scores.

Quote
In more rural area the districts will tend to be more compact since there are relatively few intra-county neighbors, and even when in different cells the distance to the neighbors will be short. But in denser areas, the in-county bonus will tend to overwhelm distance. Thus in Pennsylvania you tend to end up with a Montco district, a Bucks district, and two Philadelphia districts, the Delaware district tends to bleed into Philadelphia because it doesn't have enough population. You also get two Allegheny County districts.

The area west of Philadelphia (York, Lancaster, Harrisburg, Reading) is somewhat unstable because the counties are large comparable to a congressional district, but not quite large enough to have their own.

Yes, that's what I saw in Pennsylvania, too. You could of course argue that you don't need an algorithm to tell you that there is basically one way to draw county-preserving districts in SEPA proper, but that it's more complicated in the area west of Philadelphia. I still think that it's nice get a numerical quantification for it.

Quote
It would be interesting to experiment with the in-county bonus. I assume each run begins with random districts. If you could capture those, then you could find local minimums using different size county bonuses,

Yes, I will definitely experiment with the in-county bonus. No bonus is of course an obvious option and might lead to interesting results e.g. in SEPA. Half-distance like now. Or a weaker bonus. Or a stronger bonus. I also feel that I might have to introduce a city bonus, some proposed plans chop into Pittsburgh.

Each run begins with random districts. But even after that for each iteration the algorithm chooses the cell to reassign and the target district at random.

I fear that using different county bonuses would not lead to plans that are local minimums with other county bonuses.

What I could try is to already start with a certain plan a see if it remains roughly stable or if it drifts into a completely different plan. For example I could try if for Michigan there exists a stable plan with a Flint+Saginaw+Bay City+Midland district by simply providing such a plan as the starting point. (I think that the algorithm will hate any plan with a major Macomb chop as long as there is an in-county bonus.)
Quote
Do you recalculate the Euclidean distance repeatedly, or do only to adjust your district scores for some combinations? Since the score for any pair of cells doesn't change - it is only which pairs are summed that changes - this might make districts based on travel distance feasible.
I calculate a matrix containing the p(x)p(y) f(d(x,y)· k(x,y)) for all pairs of cells in the beginning (for PA census tracts ca. 10 million entries), then calculate the score for the start plan, and then I update the score at every iteration by only adding and subtracting the updates for the switched cell. The big matrix remains unchanged all the time.

Therefore as you say, using any type of distance you want (e.g. travel distance) is feasible. I think that it's beyond the scope of this private project, though.

Quote
In general for legislative districts where some counties will have more than one district you can apportion the districts to those counties, decomposing the map into smaller problems. I was trying to think how in Pennsylvania you could split the state into smaller areas. If you were to freeze the eastern part of the state, the districts adjacent to the frozen line would conform to the line. The effect would gradually diminish, just as a change in the Philadelphia area is unlikely to have any effect on how a Pittsburgh district is drawn.

I just improved my implementation so that the algorithm can now cope with all Pennsylvania census tracts. Decomposing the map is certainly helpful, but I hope to achieve similar effects in a less intrusive way.

Quote
In Iowa, the cells could be the counties, since we know that whole county congressional maps of high levels of equality can be drawn. Outsiders have always speculated that the maps were computer-drawn, but they are actually done by hand. It is possible to draw reasonably compact, highly equal plans by hand. Muon2 has drawn a map that is even more equal, but not compact - but since they were working with the goal of compactness they didn't find it. Perhaps for Iowa the degree of equality could be used in the score - though there is the risk that negligible improvement in equality could override everything else. I think the Census Bureau has center of population for counties, which would be the preferred location for county-based cells. You could use 2019 estimates or project forward to the 2020 census date. If you permitted a modest range in equality (say +/- 2%, you could compare any benefit in compactness vs. equality).

Iowa-style redistricting is certainly an interesting challenge. The problem with my algorithm in this regard is that it doesn't really incorporate population inequality into the score. Instead it forbids switching cells from underpopulated districts and switching cells to overpopulated districts. This leads to the resulting plans being a bit too coarse regarding population equality.

Anyway, I successfully ran the algorithm with PA census tracts, so any future maps will not be inhibited by me having to draw the annoying 500 cell plans in DRA. It's still not at fast as I would like it to be, and computing time is of course growing with the number of cells, but I think that we're on the right way to efficiently produce plans for the requested states and more.
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #13 on: March 04, 2021, 09:03:55 PM »

Do you recalculate the Euclidean distance repeatedly, or do only to adjust your district scores for some combinations? Since the score for any pair of cells doesn't change - it is only which pairs are summed that changes - this might make districts based on travel distance feasible.
I calculate a matrix containing the p(x)p(y) f(d(x,y)· k(x,y)) for all pairs of cells in the beginning (for PA census tracts ca. 10 million entries), then calculate the score for the start plan, and then I update the score at every iteration by only adding and subtracting the updates for the switched cell. The big matrix remains unchanged all the time.
Since F(i,j) = F(j,i), and F(i,i) = 0, you only need to compute and store half as many values. For (0 ≤ i < j ≤ N-1) the

index k(i,j) = (N - (i+1)/2) ) i + (j-i-1)

k(0,1) = 0
k(3216,3217) = 5,176,152

With larger memories - this might not matter.

When the PL 94-171 data comes out, you might want to switch to VTD. In 2010 there were 9256 VTD. This will provide better conformance with local political subdivisions which was the key finding of the Pennsylvania redistricting cases. Perhaps VTD could be merged into smaller cells. But that may be what you did in your original effort.
Quote
Quote
In Iowa, the cells could be the counties, since we know that whole county congressional maps of high levels of equality can be drawn. Outsiders have always speculated that the maps were computer-drawn, but they are actually done by hand. It is possible to draw reasonably compact, highly equal plans by hand. Muon2 has drawn a map that is even more equal, but not compact - but since they were working with the goal of compactness they didn't find it. Perhaps for Iowa the degree of equality could be used in the score - though there is the risk that negligible improvement in equality could override everything else. I think the Census Bureau has center of population for counties, which would be the preferred location for county-based cells. You could use 2019 estimates or project forward to the 2020 census date. If you permitted a modest range in equality (say +/- 2%, you could compare any benefit in compactness vs. equality).

Iowa-style redistricting is certainly an interesting challenge. The problem with my algorithm in this regard is that it doesn't really incorporate population inequality into the score. Instead it forbids switching cells from underpopulated districts and switching cells to overpopulated districts. This leads to the resulting plans being a bit too coarse regarding population equality.
Let's make a distinction between the plan score and your algorithm. I might be able to develop a higher scoring plan on my kitchen table than your algorithm.

Let's define a tolerable plan as one in which all districts are within 5%. Then we can define an equality score as the standard deviation (relative to the ideal population). A plan that is significantly more equal (say 0.2%) defeats another plan regardless of compactness. Two plans that are about as equal (say within 0.2%) can be compared based on compactness.

Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #14 on: March 05, 2021, 03:38:30 PM »

Do you recalculate the Euclidean distance repeatedly, or do only to adjust your district scores for some combinations? Since the score for any pair of cells doesn't change - it is only which pairs are summed that changes - this might make districts based on travel distance feasible.
I calculate a matrix containing the p(x)p(y) f(d(x,y)· k(x,y)) for all pairs of cells in the beginning (for PA census tracts ca. 10 million entries), then calculate the score for the start plan, and then I update the score at every iteration by only adding and subtracting the updates for the switched cell. The big matrix remains unchanged all the time.
Since F(i,j) = F(j,i), and F(i,i) = 0, you only need to compute and store half as many values. For (0 ≤ i < j ≤ N-1) the

index k(i,j) = (N - (i+1)/2) ) i + (j-i-1)

k(0,1) = 0
k(3216,3217) = 5,176,152

With larger memories - this might not matter.

Memory doesn't seem to be the problem at the moment. Calculation of the values of the matrix takes a minute or so (for PA census tracts), which is ok. At the moment I prefer the matrix to be stored as it is because it allows the calculations of the updates of the score function to be a simple operation on one row of the matrix.

Quote
When the PL 94-171 data comes out, you might want to switch to VTD. In 2010 there were 9256 VTD. This will provide better conformance with local political subdivisions which was the key finding of the Pennsylvania redistricting cases. Perhaps VTD could be merged into smaller cells. But that may be what you did in your original effort.

Maybe I'll find a way for the algorithm to become even faster than it is now. At the moment I think that census tracts are enough for it to digest, but I'll keep VTDs in mind.

Quote
Quote
Quote
In Iowa, the cells could be the counties, since we know that whole county congressional maps of high levels of equality can be drawn. Outsiders have always speculated that the maps were computer-drawn, but they are actually done by hand. It is possible to draw reasonably compact, highly equal plans by hand. Muon2 has drawn a map that is even more equal, but not compact - but since they were working with the goal of compactness they didn't find it. Perhaps for Iowa the degree of equality could be used in the score - though there is the risk that negligible improvement in equality could override everything else. I think the Census Bureau has center of population for counties, which would be the preferred location for county-based cells. You could use 2019 estimates or project forward to the 2020 census date. If you permitted a modest range in equality (say +/- 2%, you could compare any benefit in compactness vs. equality).

Iowa-style redistricting is certainly an interesting challenge. The problem with my algorithm in this regard is that it doesn't really incorporate population inequality into the score. Instead it forbids switching cells from underpopulated districts and switching cells to overpopulated districts. This leads to the resulting plans being a bit too coarse regarding population equality.
Let's make a distinction between the plan score and your algorithm. I might be able to develop a higher scoring plan on my kitchen table than your algorithm.

Oh yes, I insist upon the distinction between score/metric and algorithm all the time, so you're right for calling me out on that. Which opens up the next question: Calculating the score for a plan that is already in the format of my python program is fast, but how to bring the plans from the kitchen table into the format. I don't want to rewrite DRA. ;-)

Quote
Let's define a tolerable plan as one in which all districts are within 5%. Then we can define an equality score as the standard deviation (relative to the ideal population). A plan that is significantly more equal (say 0.2%) defeats another plan regardless of compactness. Two plans that are about as equal (say within 0.2%) can be compared based on compactness.

That's a good idea in principle. One problem is that my score suffers from a systematic bias that tends to prefer overpopulated dense districts and underpopulated low-density districts. I found a way to counteract that in my algorithm, but that's not the question here. But yes, giving a very low tolerance band for deviations (clearly below 1%, preferably lower) would make that problem less pressing.
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #15 on: March 06, 2021, 09:20:51 AM »

Oh yes, I insist upon the distinction between score/metric and algorithm all the time, so you're right for calling me out on that. Which opens up the next question: Calculating the score for a plan that is already in the format of my python program is fast, but how to bring the plans from the kitchen table into the format. I don't want to rewrite DRA. ;-)
How difficult would it be export plans from your program or import them to it. I'm thinking of a text file with GeoID district number pairs.

If you could export it, is trivial to bring into QGIS and produce all kinds of maps. If you could import them, then I could send you a plan, and you could score it or possibly use it as a starting point for refinement.

Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #16 on: March 06, 2021, 10:31:40 AM »

Oh yes, I insist upon the distinction between score/metric and algorithm all the time, so you're right for calling me out on that. Which opens up the next question: Calculating the score for a plan that is already in the format of my python program is fast, but how to bring the plans from the kitchen table into the format. I don't want to rewrite DRA. ;-)
How difficult would it be export plans from your program or import them to it. I'm thinking of a text file with GeoID district number pairs.

If you could export it, is trivial to bring into QGIS and produce all kinds of maps. If you could import them, then I could send you a plan, and you could score it or possibly use it as a starting point for refinement.

You mean a text file containing something like the following text?

1400000US42001030101, 9
1400000US42001030102, 13
1400000US42001030200, 2
...

(The number behind the comma being the district the census tract is assigned to.)
Both importing and exporting them would be easy.
Using QGIS is probably a good idea.
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #17 on: March 06, 2021, 09:52:14 PM »

Oh yes, I insist upon the distinction between score/metric and algorithm all the time, so you're right for calling me out on that. Which opens up the next question: Calculating the score for a plan that is already in the format of my python program is fast, but how to bring the plans from the kitchen table into the format. I don't want to rewrite DRA. ;-)
How difficult would it be export plans from your program or import them to it. I'm thinking of a text file with GeoID district number pairs.

If you could export it, is trivial to bring into QGIS and produce all kinds of maps. If you could import them, then I could send you a plan, and you could score it or possibly use it as a starting point for refinement.

You mean a text file containing something like the following text?

1400000US42001030101, 9
1400000US42001030102, 13
1400000US42001030200, 2
...

(The number behind the comma being the district the census tract is assigned to.)
Both importing and exporting them would be easy.
Using QGIS is probably a good idea.
I was thinking of the part behind the US, but either would work. Shapefiles from the Census Bureau include both.

The 140 identifies these as Census Tracts. I think the 0000 may be a reserved field, and the US may identify the country (I think there is a program to coordinate North American GIS codes). The Tract ID is formed by three sub-fields.

42 001 030101

42 = Pennsylvania
001 = Adams County
030101 = CT 301.01

In maps and printed material leading zeroes are suppressed. The last two digits are a subfield, displayed as a suffix, that usually indicates a derivate CT. CT 301.01 and 301.02 resulted from a division of CT 301. If the last two digits are 00, they are also suppressed in maps and printed material.

What I ordinarily do is to download the shapefiles from the Census Bureau. I can drag the .shp file into the canvas of QGIS, and open the .dbf file in Excel. I might strip out some of the extraneous fields. I save this as an .xlsx file. Excel really doesn't like writing .dbf files, and I also want to preserve source files.

If I am redistricting, I add a field for district number/name, and then do a sub-table for accumulating the population of the districts, and simply assign tracts/counties, etc. by filling in the district number/name.

Then I create a .csv file from selected fields, which might be as simple as the Geoid, and the district number. I load that into QGIS, join it to my shapefile, and then create maps.

I do my editing in Excel, so am repeatedly making updates to my .xlsx file, copying to the .csv file, saving that. I then remove the .csv file in QGIS, and redo the add of the .csv file and the join.

I would eliminate the space after the comma. When you write a .csv file with Excel it does not use a space, and does not quote text unless there is a comma in the text. So a city with multiple words would appear as

123456,San Francisco,856472

Also, add a header row such as

AFFGEOID,STATE,COUNTY,TRACT,DISTRICT

Having the state ID, county ID, and tract may help locate the tract when visually looking at the .csv (though .csv files are primarily for computer interchange - and all computers comprehend ASCII).

When I was doing the Iowa maps, I output all the districts in a single file:

e.g.  RUN1,RUN2,RUN3, etc.

I would probably do the same with your output - it would be easier than working with multiple files.
Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #18 on: March 07, 2021, 12:02:15 PM »

You may be interested in this.

Migration and commuting interactions fields: a new geography with community detection algorithm?

One item I found particularly interesting was the measure of stability at paragraph 21, which indicates the percentage of runs that a node was in same area.

Perhaps you could keep a count of times each pair of nodes ended up in the same district. If this is a high percentage, the nodes could be considered close friends.

Nodes with very high number of close friends could form the core of districts.

Logged
jimrtex
Atlas Icon
*****
Posts: 11,817
Marshall Islands


Show only this user's posts in this thread
« Reply #19 on: March 10, 2021, 08:48:49 AM »

I thought of another approach.

Imagine Pennsylvania had retained 18 districts and you simply wanted to equalize population. You could take your algorithm and award a bonus for census tracts that are currently in the same district.

You could begin with random districts but would tend to return towards the current districts. There would be tension between maintaining the current districts and making them more compact while equalizing population. Generally, this would minimize the number of persons assigned to new districts.

Now repeat with 17 districts for a few districts, the bonus will not be sufficient to keep them intact, but may retain larger segments.

Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #20 on: April 07, 2021, 04:10:23 PM »

It has been a month since my last post on this topic, but I haven't forgotten the computational redistricting project in the meantime. Non-atlas things needed some time and energy. When I worked on the project, I was mostly working on North Carolina and how to nudge the algorithm into drawing a black VRA district in the Northeast. I made some progress, more on this in the next post.

You may be interested in this.

Migration and commuting interactions fields: a new geography with community detection algorithm?

One item I found particularly interesting was the measure of stability at paragraph 21, which indicates the percentage of runs that a node was in same area.

Perhaps you could keep a count of times each pair of nodes ended up in the same district. If this is a high percentage, the nodes could be considered close friends.

Nodes with very high number of close friends could form the core of districts.



The Louvain method used in the article seems to be a clustering method that has some similarities to the method I employ. The main differences that I see between the article's methodology and mine are:
- No constraints requesting equal size
- Metric based on social features like commuting and migration instead of pure geographic distance. Arguably this could even be preferable in identifying real-life CoI, but it's too difficult to obtain all the data in an appropriate format for my purposes, so I'll leave it at that.
- Even if there was a purely geographic metric approximating the commuting/migration based metric, it would be fundamentally different from mine, because it is bounded from below by 0, whereas the logarithm that I use, decreases by the same amount every time that the distance is doubled. Because of this the metric is probably also not scale-free.

Counting the number of times that nodes end up in the same district as a means for drawing districts feels like an unnecessary loop to me. By that I don't mean that observing the stability of districts on different runs is a bad idea, on the contrary, but I think that it's already an interesting thing to watch in its own right.

I thought of another approach.

Imagine Pennsylvania had retained 18 districts and you simply wanted to equalize population. You could take your algorithm and award a bonus for census tracts that are currently in the same district.

You could begin with random districts but would tend to return towards the current districts. There would be tension between maintaining the current districts and making them more compact while equalizing population. Generally, this would minimize the number of persons assigned to new districts.

Now repeat with 17 districts for a few districts, the bonus will not be sufficient to keep them intact, but may retain larger segments.

Good idea. The bonus would work in the same way that it does for counties.
Logged
palandio
Jr. Member
***
Posts: 1,026


Show only this user's posts in this thread
« Reply #21 on: April 07, 2021, 04:38:34 PM »


The main challenge was to force the algorithm to draw a VRA district. Even now it often doesn't, but then I can interrupt the run. But even if it does draw a VRA district, due to the additional constraints, the VRA district often happens to include stray precincts. The above map comes from the combination of two different formulation of the constraint used during different phases of the run and if you look closely you can still spot two stray precincts in Goldsboro and Kinston, but before that I got far more stray precincts and often even an exclave of the Northeastern VRA district in Anson county.

During numerous runs I saw the following patterns:
- A triad seat often emerged early in the run, but it was never stable and always got replaced by a Greensboro seat and a Winston-Salem seat like you can see in the map above. There might be some exotic configuration with a stable triad seat, but I doubt it.
- Greater Charlotte is the area with the greatest variation of general setups. Often enough I get a stable seat that stretches from Gaston county to Union county. Sometimes a Charlotte district, a Western suburbs district and an Eastern suburbs district with the Hickory district extending all the way to the north. Sometimes Charlotte is divided between a Gaston district and a Union district.
- Often enough Durham forms a district with the western part of Wake county.
Logged
Pages: [1]  
« previous next »
Jump to:  


Login with username, password and session length

Terms of Service - DMCA Agent and Policy - Privacy Policy and Cookies

Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Page created in 0.089 seconds with 13 queries.