Household surveys
We accessed geolocated household surveys involving the whole enumeration of 926 microcensus clusters of roughly three settled hectares in 5 provinces within the western a part of the DRC. The data was collected throughout two rounds of household surveys led by the UCLA-DRC Health Research and Training Program based mostly on the University of California, Los Angeles Fielding School of Public Health and the Kinshasa School of Public Health (KSPH)30. The first spherical of surveys was carried out between May and July 2017 within the provinces of Kinshasa, Kwango, Kwilu, and Mai-Ndombe using random sampling, whereas the second spherical was carried out between October and December 2018 within the provinces of Kinshasa and Kongo Central using population-weighted sampling16. The surveys had been developed for the bottom-up population modeling within the provinces talked about above. Given the time and sources wanted to journey to and absolutely enumerate the 926 clusters, solely important demographic data (e.g., household dimension and age and intercourse traits) had been collected within the surveys. All survey data had been nameless and purged of any personally identifiable data earlier than being acquired for evaluation.
In each surveys, seed areas (i.e., 100 m grid cells) had been first chosen, and cluster boundaries had been subsequently manually delineated round these areas to incorporate roughly three settled hectares with related settlement traits (e.g., building dimension and form) assessed from satellite tv for pc imagery. We accessed the sampling weights for the second spherical of surveys and assessed their statistical distribution to determine outliers related to identified uncertainties within the gridded population data used within the sampling16,18. To restrict the consequences of outliers ensuing from the gridded population data used within the sampling, we truncated the sampling weights on the 90th percentile of the statistical distribution. We retrieved population totals for the clusters from the population counts recorded inside every household the place knowledgeable consent was obtained (n = 79,126) and imputed population in households with a nonresponse (n = 629) based mostly on the imply population per household inside the identical cluster. We additionally retrieved population totals for standardized age (i.e., underneath 1-year-old, 5-year teams from 1 to 80, and above 80 years previous) and intercourse (i.e., male and feminine) teams inside every province by aggregating particular person survey information.
Ethical compliance for the data assortment was authorised by the institutional evaluation boards on the University of Kinshasa School of Public Health (KSPH) Ethics Committee and on the University of California Los Angeles Institutional Review Board (UCLA IRB). Ethical approval for the data evaluation was granted on the University of Southampton Ethics Committee.
Building footprints
We accessed building footprint data mechanically extracted by Ecopia.AI in 2019 using satellite tv for pc imagery offered by Maxar Technologies inside the DRC31. The imagery used for characteristic extraction offers the very best quality (i.e., lower than 5% cloud protection and 0.3% protection gaps) and the latest (i.e., on common newer than 2017) illustration of artificial buildings seen on the bottom, together with each residential and non-residential buildings. However, outdated satellite tv for pc imagery (i.e., courting again to 2009) and different contextual components (e.g., clouds, smoke resulting from slash-and-burn agriculture, and forest cover protection) might have an effect on the automated extraction of building footprints in probably the most dynamic and distant settlement sorts. Given the sturdy high quality management course of developed by Ecopia.AI, the building footprints are thought of to offer probably the most correct and current approximation of the spatial distribution of populations throughout the 5 provinces. Some different building footprint datasets (e.g., Microsoft Building Footprint Data) have not too long ago been launched, however their protection of the DRC isn’t but out there or optimum.
We used the building footprints to derive morphological and topological attributes, akin to space, perimeter, variety of nodes, and distance to the closest characteristic22,23. We summarized these attributes inside the microcensus clusters and grid cells of roughly 100 m comprising the 5 provinces using fundamental abstract statistics, such because the sum, imply, and coefficient of variation22. We additionally produced the identical abstract statistics for focal home windows of roughly 500 m, 1 km, and 2 km to replicate contextual traits23. We allotted the microcensus clusters and the grid cells to city and rural settlements using an current morphological classification derived from the identical building footprint data by the Center for International Earth Science Information Network (CIESIN)32. We labeled the unique built-up space class as city settlement and merged the unique lessons small settlement space (i.e., representing rural settlements) and hamlet (i.e., remoted rural settlements) into a category labeled rural settlement. Urban settlements are characterised by contours with an space larger than or equal to 40 building footprint ha with a building density of at the very least 13 building footprints throughout it, whereas rural settlements embody the remaining a part of the research space32.
Administrative boundaries
We accessed administrative boundaries offered by the Bureau Central du Recensement (BCR), the executive physique liable for the census implementation within the DRC33. The boundaries comprised the executive stage 0 (i.e., nation), stage 1 (i.e., provinces), stage 2 (i.e., territories and cities), and stage 3 (i.e., sectors/chiefdoms and municipalities). At the time of this research, the executive boundaries had been being consolidated, and stage 3 boundaries had been solely out there for town of Kinshasa. We first derived the spatial extent of the provinces from the extent 1 boundaries and subsequently created the sub-provincial areas by combining stage 2 and stage 3 boundaries. In doing so, we merged the extent 3 boundaries of the 24 municipalities comprising town of Kinshasa into 9 contiguous teams of municipalities with related settlement traits as reported within the Strategic Orientation Plan for the Agglomeration of Kinshasa34. For occasion, the boundaries of the municipalities of Bandalungwa, Kintambo, Ngaliema, and Selembao had been merged into a bunch similar to the western enlargement of town. This ad-hoc grouping of municipalities ensured that each sub-provincial area would comprise at the very least one microcensus cluster to estimate random intercepts within the population mannequin. Lastly, we produced gridded datasets with a decision of roughly 100 m with distinctive identifiers for every province and sub-provincial area and subsequently allotted the microcensus clusters to a single province and a sub-provincial area.
Covariate processing and choice
We first constrained the extent of the clusters using the building footprints positioned inside a radius of roughly 50 m from the surveyed households to exclude areas that weren’t surveyed due to accessibility constraints. We then derived morphological and topological attribute summaries from the building footprints and extracted extra summaries from normal gridded datasets used within the research of population distributions, for example, temperature, precipitation, land use, and night-time mild depth35. Model covariates had been chosen by assessing relationships between log-population densities (folks/building footprint ha) and the attribute summaries throughout the clusters using scatterplots and Pearson correlations. This process enabled us to retain the 5 covariates with the strongest linear affiliation to population densities — (1) building depend (depend of buildings), (2) common building space (in ha), (3) common building perimeter (in m), (4) common building proximity or the inverse of the space to the closest building (in m), and (5) common building focal depend (common depend of building inside a focal window of roughly 2 km). To keep away from multicollinearity, we assessed Pearson correlations between the 5 covariates and subsequently discarded common building perimeter as a result of it was strongly correlated with common auilding space. Building depend was additionally discarded to keep away from potential data circularity as a result of it was utilized in different elements of the mannequin. The chosen covariates had been lastly scaled based mostly on the imply and normal deviation computed on the grid cell stage throughout the research space.
Data processing and covariate choice had been carried out in R model 4.0.236 using the R packages raster37 model 3.0 and sf38 model 0.7.
Population mannequin
We modeled population totals by extending an current hierarchical Bayesian modeling framework for population estimation6. The hierarchical modeling framework affords nice flexibility and adaptability to complicated enter data, akin to household survey data, whereas precisely reflecting mannequin uncertainty by means of Bayesian credibility intervals. Model uncertainty is related to the lack to seize options within the enter data, for example, observational error or restricted pattern dimension. Equation (1) fashions the entire variety of folks Ni as a Poisson course of, the place Di is the population density (folks/ building footprint ha) and Ai is the entire space of building footprints (ha) derived from the building footprints inside every microcensus cluster i. The use of building footprints offers a precious extra supply of knowledge that constrains the estimation of population totals inside an inexpensive vary.
$${N}_{i}sim {{{{{rm{Poisson}}}}}}left({D}_{i}{{A}}_{i}proper),$$
(1)
Equation (2) fashions Di as a log-normal course of to calm down the assumptions of the Poisson distribution, the place ({bar{D}}_{i}) is the anticipated population density on a log-scale and ({tau }_{t,{p},{i}}) is a hierarchical precision time period estimated by settlement kind (t) and province (p) for every cluster i.
$${D}_{i}sim {{{{{rm{LogNormal}}}}}}left({bar{D}}_{i},{tau }_{t,{p},{i}}proper)$$
(2)
Equation (3) defines the precision time period ({tau }_{t,{p},{i}}) based mostly on a hierarchical estimate of precision ({tau }_{t,p}) and the mannequin weights vi19. ({tau }_{t,p}) is estimated hierarchically by settlement kind t and province p using uninformative priors on the imply μt and the variance ({sigma }_{t}) phrases, that are modeled by a traditional and uniform distribution, respectively.
$${tau }_{t,p,i}=sqrt{frac{1}{{v}_{i}{tau }_{t,p}^{-2}}}$$
(3)
$${tau }_{t,p}sim {{{{{rm{Half}}}}}}{-}{{{{{rm{Normal}}}}}}left({mu }_{t,p},,{sigma }_{t,p}proper)$$
$${mu }_{t,{p}}sim {{{{{rm{Half}}}}}}{-}{{{{{rm{Normal}}}}}}left({mu }_{t},{sigma }_{t,}proper)$$
$${sigma }_{t,p}sim {{{{{rm{Uniform}}}}}}left(0,,{sigma }_{t}proper)$$
$${mu }_{t}sim {{{{{rm{Normal}}}}}},left(0,,1000right)$$
$${sigma }_{t}sim {{{{{rm{Uniform}}}}}},left(0,,1000right)$$
Equation (4) defines the mannequin weight vi because the inverse of the sampling weight wi used to pick cluster i within the second spherical of household surveys. The sum of wi is used to proportionally impute wi for the clusters that had been chosen randomly throughout the first spherical of household surveys. vi are then rescaled to sum to at least one throughout all of the clusters I
$${v}_{i}=frac{{w}_{i}^{-1}}{{sum }_{i=1}^{I}{w}_{i}^{-1}}$$
(4)
As the estimate of precision ({tau }_{t,{p},{i}}) can’t be derived in areas the place the mannequin weights wi usually are not out there and adopted for posterior mannequin predictions, Eq. (5) determines a hierarchical estimate of precision ({hat{tau }}_{t,p}) from a weighted common of ({tau }_{t,{p},{i}}), the place ({I}_{t,{p}}) is the variety of clusters i inside settlement kind t and province p.
$${hat{tau }}_{t,p}=frac{{sum }_{i=1}^{{I}_{t,p}}{tau }_{t,p,i}sqrt{{v}_{i}}}{{sum }_{i=1}^{{I}_{t,p}}sqrt{{v}_{i}}}$$
(5)
Equation (6) makes use of the precision estimate ({hat{tau }}_{t,p}) for posterior mannequin predictions by altering Eq. (2).
$${hat{D}}_{i}sim {{{{{rm{LogNormal}}}}}}left({bar{D}}_{i},,{hat{tau }}_{t,p}proper)$$
(6)
Equation (7) fashions the anticipated population density ({bar{D}}_{i}) using a linear regression with random intercept ({alpha }_{t,{p},{l}},)estimated by settlement kind t, province p, and native space l and Ok covariates ({x}_{okay}) with random results ({beta }_{okay,{t}}) estimated by settlement kind t.
$${bar{D}}_{i}=,{alpha }_{t,p,l}+{sum }_{okay=1}^{Ok}{beta }_{okay,t},{x}_{okay,i}$$
(7)
Equation (8) fashions the hierarchical intercept ({alpha }_{t,{p},{l}}) for a neighborhood space l belonging to a settlement kind t and province p as a nested hierarchy with uninformative priors on the imply ({xi }_{t,{p}}) and variance ({nu }_{t,{p}}) phrases. These are modeled using a traditional and uniform distribution, respectively.
$${alpha }_{t,p,l}sim {{{{{rm{Normal}}}}}}left({xi }_{t,p},,{nu }_{t,p}proper)$$
(8)
$${xi }_{t,{p}}sim {{{{{rm{Normal}}}}}}left({xi }_{t},{nu }_{t}proper)$$
$${nu }_{t,{p}}sim {{{{{rm{Uniform}}}}}}left(0,{nu }_{t}proper)$$
$${xi }_{t}sim {{{{{rm{Normal}}}}}},left(0,1000right)$$
$${nu }_{t}sim {{{{{rm{Uniform}}}}}},left(0,1000right)$$
Equation (9) fashions the random results ({beta }_{okay,t},)for every covariate (okay) independently for every settlement kind (t) with uninformative priors on the imply ({rho }_{okay}) and variance ({omega }_{okay}) phrases, which comply with a traditional and uniform distribution, respectively.
$${beta }_{okay,t}sim {{{{{rm{Normal}}}}}},left({rho }_{okay},,{omega }_{okay}proper)$$
(9)
$${rho }_{okay}sim {{{{{rm{Normal}}}}}},left(0,1000right)$$
$${omega }_{okay}sim {{{{{rm{Uniform}}}}}},left(0,1000right)$$
For every covariate okay, random results ({beta }_{okay,t}) with related estimated posterior distributions throughout settlement sorts t are transformed into a set impact ({beta }_{okay}) modeled with an uninformative regular distribution (Eq. (10)).
$${beta }_{okay}sim {{{{{rm{Normal}}}}}},left(0,,1000,proper)$$
(10)
Age and intercourse construction mannequin
Age and intercourse buildings are modeled as a Dirichlet-multinomial course of21. This distribution is usually used to mannequin compositional depend data, in different phrases, the depend of observations (e.g., folks) belonging to mutually unique classes (e.g., age and intercourse teams). Equation (11) fashions the noticed depend of individuals ({N}_{g,p}) inside an age and intercourse group (g) and a province p as a Multinomial course of, the place ({pi }_{g,p}) is the relative proportion of the age and intercourse group and ({N}_{p}) the noticed population inside the province p. (g) contains G mutually unique age and intercourse teams—two intercourse teams (i.e., male and feminine), every subdivided into 18 age teams (i.e., underneath 1-year-old, 1 to 4 years previous, 5-year teams from 5 to 80, and above 80 years previous). Age and intercourse proportions weren’t modeled on the sub-provincial stage as a result of decreased pattern sizes might lead to spurious estimates for the smallest teams9,21.
$${N}_{g,p}sim {{{{{rm{Multinomial}}}}}}left({N}_{p},,{pi }_{g,p},proper)$$
(11)
Because the sum of ({pi }_{g,p}) inside every p is constrained to at least one, Eq. (12) makes use of an uninformative Dirichlet distribution as a conjugate prior for ({pi }_{g,p}) the place ({{{{{{rm{chi }}}}}}}^{G}) is a continuing numerical vector with values (1/G) and of size G13.
$${pi }_{g,p}sim {{{{{rm{Dirichlet}}}}}}left({{{{{{rm{chi }}}}}}}^{G}proper)$$
(12)
Model match and diagnostics
We estimated the mannequin with MCMC strategies in JAGS 4.3.039 using the R bundle runjags40 model 2.0.4. The convergence of three MCMC chains was assessed using the Gelman-Rubin statistic, and values lower than 1.1 had been interpreted as indicating convergence41, whereas the mannequin residuals had been examined for spatial autocorrelation using semivariograms and Moran’s I statistics. We examined mannequin match in- and out-of-sample using 10-fold cross-validation, the place the mannequin was match ten instances, every time withholding a random 10% of microcensus clusters till all had been held out as soon as. To assess mannequin match for age and intercourse proportions, we held out 10% of the clusters for every province and assessed the mixed posterior distribution for every demographic group. For in- and out-of-sample predicted population sizes, densities, and province-level age and intercourse proportions, we evaluated bias (i.e., the imply of residuals—imply posterior predictions minus noticed values), imprecision (i.e., the usual deviation of residuals), inaccuracy (i.e., the imply of absolute residuals), R2 values (i.e., the squared Pearson correlation coefficient among the many residuals), and the proportion of observations falling inside the 95% prediction intervals. We additionally computed bias, imprecision, and inaccuracy using standardized residuals (i.e., residuals divided by the imply posterior predictions)41.
Reporting abstract
Further data on analysis design is offered within the Nature Research Reporting Summary linked to this text.