Health Reports
Accuracy of matching residential postal codes to census geography

by Lauren Pinault, Saeeda Khan and Michael Tjepkema

Release date: June 17, 2020

DOI: https://www.doi.org/10.25318/82-003-x202000300001-eng

In the majority of Canadian health survey or administrative datasets, the geographic information available to researchers is limited to a residential postal code rather than a full street address, often for confidentiality reasons. Postal codes are six-character alphanumeric codes created by Canada Post Corporation to sort and deliver mail.Note 1 Since postal codes do not always reflect well-defined, discrete and homogeneous spatial units, there is some uncertainty in using them to assign geographic attributes, such as exposures to environmental hazards. Postal code geography also does not always correspond with census geography, since many postal codes cross census boundaries.Note 2 Therefore, postal codes may be linked to one or more units of census geography.

Postal codes are used by a broad range of health research disciplines to assign contextual covariates to subjects. Neighbourhood estimates of socioeconomic deprivation and other demographic variables have been used to predict behavioural problems in childrenNote 3 and higher rates of depression in urban areas.Note 4 The Public Health Agency of Canada has created area-based indicators to capture health status and health determinants to highlight regional health inequalities in Canada.Note 5 In the absence of individual-level socioeconomic data in a health dataset, researchers have also used area-based estimates as a proxy for personal income to adjust for model confounding. For example, neighbourhood income was used to adjust models of diabetes prevalence among immigrants to Ontario,Note 6 in a study of patients with rheumatoid arthritis who received specialized services,Note 7 and in a study of people who sought treatment for HIV/AIDS in British Columbia.Note 8 Postal codes have been used alongside inconsistent street address records in a coroners’ dataset to improve the geolocation of drug overdose sites, therefore allowing researchers to evaluate the efficacy of a supervised injection site in Vancouver.Note 9

In environmental health research, geography based on postal code is used in geographic information systems to link survey and administrative health datasets to environmental data from a range of sources. For example, for the purpose of cross-sectional analysis, obesity and diabetes data from the Canadian Community Health Survey have been attached to area-based population counts and residential density from the census; availability of walkable destinations from government datasets; and indicators of land-use mix, street connectivity, and the density of fast-food restaurants and grocery stores from proprietary datasets.Note 10Note 11 In longitudinal mortality studies, large population-based health cohorts have been assigned estimates of ambient air pollution, greenness and blue space based on coordinates derived from postal codes reported on tax files.Note 12Note 13Note 14Note 15

Statistics Canada produces two products that attach six-character postal codes to units of census geography: (1) the Postal Code Conversion File (PCCF),Note 2 which provides all possible matches between postal codes and census geography, and (2) the Postal Code Conversion File Plus (PCCF+),Note 16 which uses both population weighting and random allocation to assign specific census geography to individuals in a dataset. Essentially, the PCCF+ uses population weighting to inform the selection of one of the matches provided on the PCCF file for each record. A secondary function of the PCCF+ is to assign a representative point of latitude and longitude to each individual; the positional accuracy of this assignment across Canada has been assessed in a previous paper.Note 17

Although hundreds of health studies have used the PCCF or PCCF+ for geocoding,Note 18 the accuracy of assigning census geography based on postal codes across all of Canada remains poorly studied. One such study in Nova Scotia estimated positional error in assigning census geography by comparing the PCCF+ with the Nova Scotia Civic Address File. The study found lower rates of error in urban areas than in rural areas, although some misclassification occurred in all areas of the province.Note 19

This study describes the characteristics of residential postal codes of the Canadian population using the 2016 Census and determines how frequently these postal codes are matched to one or more dissemination areas (DAs), a unit of census geography. DAs were assessed because they are small, relatively stable and discrete areas that represent approximately 400 to 700 people.Note 20 DAs are the smallest geographic unit for which census data are disseminated,Note 20 and they are the unit most frequently used in health research to attach neighbourhood measures or environmental exposures based on geography.Note 3Note 6 Since a geocoding error could be introduced if a postal code matches to more than one DA, the number of possible matches to DAs was used here as an indicator of geocoding accuracy. For comparison, the percentage of the population matched to single census tracts (CTs), census subdivisions (CSDs) and census divisions (CDs) was also determined.

Methods

How the Postal Code Conversion File Plus geocodes to dissemination areas

The PCCF includes all possible matches of postal codes to census geography (block face, dissemination block and DA). The PCCF+ program selects one of these matches for each postal code in a user-supplied health dataset using a stepwise process (Figure 1). First, rural postal codes (denoted by a second digit of “0”) and those with delivery mode types (DMTs) that indicate rural route delivery (DMT=H), general delivery (DMT=J), suburban service (DMT=T), post office boxes (DMT=K) or retired routes (DMT=Z) are matched to a subset of the weighted conversion file (WCF) based on probabilities from census population weights. The WCF is a dataset that combines postal codes from the PCCF with population counts from the census at the DA level. It is used to generate population weights for each combination of postal code and DA. Although postal codes in rural areas tend to have lower geocoding accuracy because of their larger spatial extent (i.e., residences for a given postal code are spread over a larger area), the population weighting technique allows an individual-level dataset to be matched to census geography in a way that is consistent with the spatial distribution of the overall population.

PCCF records can include postal codes matched to one (unique) or many (duplicate) records of census geography. During the second step, the remaining postal codes that correspond with unique PCCF records (mostly urban postal codes) are matched to census geography. Third, the remaining postal codes that are matched to duplicate PCCF records are randomly matched with equal probability to all possible PCCF records. However, any remaining unmatched postal codes that have a DMT of rural, H, J, K or T are flagged during this step to be processed based on the first three characters of the postal code, called forward sortation areas (FSAs), in the final steps. In these final steps, the PCCF+ attempts to impute census geography based on partial postal codes (using the first five, four, three or two characters iteratively) using census population weights (Figure 1). Partial postal codes are considered where provided, or when full postal codes do not match to the PCCF.

Compared with the PCCF+ stepwise process, the original PCCF single-link indicator (SLI) selects one record for each postal code—generally the one with the largest number of dwellings.Note 2 Using the PCCF SLI, individuals in a health dataset would be matched only to DAs that are represented by the SLI. This results in clusters of high population areas surrounded by many DAs that are not matched to postal codes (Figure 2a). On the other hand, the PCCF+ allocates the rural population to DAs based on the population weights and allows users to ensure that the overall health dataset they are using is spatially distributed in a way that more closely resembles the actual population distribution (Figure 2b).

Descriptive statistics—methods

Postal codes for the entire population of Canada were obtained from the 2016 Census short-form questionnaire. All postal codes for the population of Canada were processed with PCCF+ version 7B to obtain the following variables: province or territory, DMT, population centre or rural area type, and census metropolitan area (CMA). PCCF+ version 7B uses postal code records as of November 2018.Note 16

DMT indicates the method of mail delivery attributed to the postal code. It is useful to consider DMT in geocoding accuracy since previous studies have indicated that accuracy varies substantially by DMT category.Note 17 The largest groupings of DMT categories are A (urban street address), B (urban apartment building) and rural postal code (no letter, but assigned a DMT of “W” by the PCCF+). DMTs E, G and M indicate business addresses or large volume receivers, and DMT K indicates post office boxes.Note 1 DMT Z indicates a postal code that has since been retired by Canada Post Corporation.Note 1 These were included since postal codes that appear on the 2016 Census may have been retired since that date.

Population centre and rural area types are discrete geographic units that characterize the urban–rural continuum using a combination of population size, contiguity and density.Note 20 Core populations are defined as at least 50,000 people (CMA) or 10,000 people (census agglomeration [CA]). A secondary core is similar to a core in that it comprises at least 10,000 people, but it differs in that it was once the core of a CA that later merged with a growing CMA. Fringe areas include population centres within a CMA or CA with fewer than 10,000 people. Otherwise, the area is classified as rural.Note 20 CMAs are defined as having a total population of at least 100,000 people, of which 50,000 people must live in the core, while CAs must have a total population of at least 10,000 people.Note 20 Because of sample sizes, some analyses combined rural areas inside CMAs or CAs with rural areas outside CMAs or CAs into “rural areas,” and population centres outside CMAs and CAs with secondary cores as “secondary population centres.”

Since postal codes can be assigned to more than one DA, the number of possible matches between each postal code and DA (i.e., records in the PCCF) was calculated separately using the original PCCF. For example, if a postal code was matched to three possible DAs in the PCCF, a count of three was recorded. These counts were merged with the 2016 Census data to obtain population counts for each postal code–DA match. The percentage of the population with postal codes that link to one DA was used as an indicator of greater geocoding accuracy. Population counts and the percentages of the population linked to one or multiple DAs were calculated by province or territory, DMT, and population centre or rural area type. The percentage of the population with a postal code linked to a single CSD, CD or CT was also calculated for comparison. CDs are a subdivision of provinces and territories that are a set of neighbouring municipalities, often with common services.Note 20 CSDs are a subdivision of CDs and generally representative of single municipalities or areas treated as municipal equivalents for statistical purposes (e.g., Indian reserves).Note 20 CTs are only located within CMAs or CAs and are small, stable areas, usually with a population of fewer than 10,000 people.Note 20

Results

The majority of Canadians were serviced by a DMT of type A (urban street address, 70.6%), B (urban building address, 9.8%) or rural postal code (16.7%); however these proportions varied among the provinces and territories (Table 1). In Quebec, Ontario, Alberta and British Columbia, proportions were similar to national estimates. However, rural postal codes were more common in the Atlantic provinces (except New Brunswick, where rural postal codes have been discontinued by Canada Post Corporation), Manitoba, Saskatchewan and the three territories. DMT also varied substantially among population centre and rural area types. Urban street address (DMT=A) was the most common in a CMA or CA core or fringe, as well as in a secondary core. However, in population centres and rural areas outside a CMA or CA, rural postal codes were the most common.

In Canada, 72.6% of the population’s postal codes were matched to a single DA, and 9.8% were matched to two DAs (Table 2). However, these proportions varied by province and territory, DMT, and population centre or rural area type. Among provinces and territories, similarly high proportions of the population (68.0% to 79.5%) were matched to a single DA in Quebec, Ontario, Manitoba, Saskatchewan, Alberta and British Columbia. However, in Prince Edward Island, 48.0% of the population was matched to five or more DAs, and in Nunavut, 71.9% of the population was matched to two or three DAs. By DMT, individuals with urban street addresses (DMT=A) and those residing in apartment buildings (DMT=B) were matched to a single DA in 85.3% and 95.3% of cases, respectively. However, those with rural postal codes were most commonly matched to five or more DAs (62.9%), and rarely matched to a single DA (13.9%). By population centre or rural area type, individuals living in a CMA or CA core or in a secondary core were matched to a single DA the majority of the time (from 74.9% to 87.1%). In contrast, individuals living in rural areas and population centres outside CMAs and CAs were more commonly matched to multiple DAs.

Table 3 presents the proportion of people matched to a single DA by combined characteristics. In general, individuals with urban street addresses (DMT=A) were matched to a single DA most of the time (72.6% to 98.4%), except in fringe areas of New Brunswick and Manitoba, secondary population centres in Saskatchewan and Alberta, and rural areas of the Prairies and British Columbia. People living in apartment buildings (DMT=B) had the highest accuracy (matching to a single DA 84.0% to 100.0% of the time), except in rural Alberta and secondary population centres in Manitoba. Individuals with rural postal codes were generally less frequently matched to a single DA, but results varied by province and territory as well as by population size. The proportions of people matched to a single DA varied substantially for all of the other DMTs examined. Note that not all provinces and territories have all DMTs and population centre or rural area types.

Table 4 provides population estimates, the proportion of the population matched to the three most common DMTs and the proportion of the population matched to a single DA, for all CMAs in Canada. For most CMAs, the vast majority of the population was linked to urban street addresses (DMT=A) and to a single DA. In Ontario, the CMAs of Kingston, Peterborough and Greater Sudbury had a higher proportion of the population with rural postal codes (20.5% to 24.6%) and a lower overall proportion of the population linked to a single DA (61.9% to 70.2%). For Barrie, Ontario, only 52.5% of the population was linked to a single DA despite 80.7% of the population having an urban street address (DMT=A).

To place the previous results into context, the percentages of the population linked to a single CSD, CD or CT by different characteristics are provided in Table 5. In general, most of the population (89.3%) was matched to a single CSD, with a notably lower percentage in Prince Edward Island (47.5%) and Nova Scotia (75.9%), rural areas (51.1% to 68.3%), and population centres outside a CMA or CA (67.9%). The proportion of the population matched to a single CSD was extremely high for DMTs A, B, E, G and M (greater than 99%), but was considerably lower for DMTs of types H and rural (43.5% and 44.1%). Results for CDs followed similar patterns as those for CSDs; however, the percentage match to a single CD was higher than for a CSD. Almost 26.7 million Canadians were matched to a CT, of whom 92.1% were matched to a single CT record. The percentage of postal codes matched to a single CT was lower for DMTs of types H, T, rural and Z, as well as for all areas outside a CMA or CA.

Discussion

This study uses postal codes that match to a single DA as an indicator of greater accuracy in assigning DA-level neighbourhood variables. Overall, 72.6% of the population of Canada was matched to a single DA, and this proportion was higher (87.1%) when restricted to Canadians residing within the urban core of a CMA or CA. Individuals with a DMT of urban street address (DMT=A) or apartment building (DMT=B) were also most commonly linked to a single DA. These estimates were provided by province and territory and by specific CMA and CA to inform studies of provincial and regional health datasets. For larger units of census geography, the percentages of the population with a postal code that linked to a single unit were 89.3% for CSDs, 95.4% for CDs and 92.1% for CTs.

In general, the finding that geocoding in urban areas is more accurate is consistent with the literature. Several other studies have examined point accuracy in geocoding (i.e., latitude and longitude), finding that postal codes can provide relatively accurate point estimates in urban centres, but are less accurate in rural regions.Note 17Note 21Note 22Note 23 For example, a study of patients undergoing cardiac catheterization in Calgary found that postal code coordinates were within a city block of the residence 87.9% of the time.Note 21 In a previous study of the accuracy of geocoding latitude and longitude using the PCCF+, median distances between the representative latitude and longitude (i.e., in the PCCF+) and the true location of the residence were small (0.16 km to 0.33 km) in communities of at least 10,000 people, but quite large in rural areas (5.60 km).Note 17

In addition to urban and rural variables, users can consider the DMT variable to enhance their knowledge of postal code geocoding accuracy. In a previous study, postal codes with DMTs A and B had the shortest distances between the representative latitude and longitude and the true location of the residence, whereas rural postal codes had some of the longest distances.Note 17 Other DMTs had varying degrees of accuracy, but the current study indicates that these DMTs are used by about 2.8% of the Canadian population. Accuracy in this group is heterogeneous because it is dependent on the postal code route. For example, in a rural route (DMT=H), the route may begin at a post office and then follow a nonlinear path to a rural area, cutting across several DAs in the process. DMT is provided as an output variable by the PCCF+, and the program also flags some DMTs (i.e., postal codes associated with businesses) as possibly non-residential in the problem output file. The PCCF+ also pulls some of the DMTs that are more likely to link to multiple DAs and attaches contextual data to a less precise three-character postal code (FSA) to reflect this uncertainty.

The PCCF+ is particularly useful in assigning census geography (e.g., DAs) for studies that attach neighbourhood-level covariates to a health dataset. DAs are frequently used because they are the smallest geographic unit of publicly available census data from Statistics Canada. Additionally, most neighbourhood-level socioeconomic indicators are produced at the DA level, such as the Canadian Marginalization Index (CAN-Marg)Note 24 and the Quebec index of material and social deprivation.Note 25 In environmental health research, several environmental indicators are also produced at the DA level, such as the Canadian Active Living Environments Database (Can-ALE).Note 26Note 27 Most studies using these data focus on urban population centres, where DA-level geocoding is most accurate. In addition, many contextual covariates are spatially autocorrelated, meaning that adjacent neighbourhoods (i.e., DAs) might have similar values of covariates attributed to them. For example, ethnic concentration might be similar across adjacent DAs in a given urban centre. In these cases, if a postal code is randomly assigned to an adjacent DA, the neighbourhood estimate for ethnic concentration is still likely to be similar. Therefore, issues may arise for studies that include either more rural areas, or a particularly heterogeneous neighbourhood covariate across the landscape.

For primarily urban health datasets (e.g., city-level cohorts), DA assignment may therefore be considered reasonably accurate. However, in national cohorts, or cohorts that are primarily rural, DA assignment may present a problem. In these cases, if a population dataset is sufficiently large and the research question requires that the overall population distribution be only generally representative, inaccurate DA assignment might be mitigated by the average assignment of DAs across the landscape based on population weighting. However, in cases where a research question requires that the DA correspond with specific members of a cohort, such as when comparing individual characteristics with neighbourhood characteristics, or where a cohort is smaller in size, then this inaccuracy in geocoding might present a serious limitation. In these cases, a recommended alternative approach might be to consolidate DAs to a larger unit, such as a CSD or CD, to improve accuracy in matching postal codes to census geography.

The PCCF+ can help users identify different sources of possible error in an input health dataset by providing some diagnostic functionality. Missing or invalid postal codes in the source file will be flagged if they are not present in the PCCF. In these cases, users can manually check the postal codes, address ranges and street spelling using the Canada Post Corporation Internet lookup tool, or verify these postal codes against other sources (e.g., with members of a particular community). In its output, the PCCF+ provides users with a quality indicator for PCCF records that were geocoded using the automated system.Note 16 The PCCF+ also flags probable non-residential postal codes, that is, DMTs E, G and M. Since the PCCF+ uses population weighting in rural areas to attach census geography, it might be preferable to use the PCCF SLI to assign geography to these flagged postal codes. Canada Post Corporation can also retire postal codes and reinstate (or “rebirth”) them later in a different area. Therefore, users are encouraged to verify records where the DMT=Z, indicating retired postal codes, and to understand that there may not be exact correspondence between the current location of a postal code and a historical one. This may be particularly relevant in longitudinal health studies. Historical DMTs are also provided as output and can be used to check the previous type of a postal code before it was retired and reinstated.

An additional consideration in using the PCCF+ to assign residential census geography is that the program relies on census population weighting to perform random allocation of census geography. As a result, the program assumes that the geographic distribution of the health dataset is similar overall to that of the Canadian population. However, this may not always be the case. For example, researchers conducting a health study of an on-reserve Indigenous population may wish to manually select PCCF records that match to a reserve rather than to an adjacent town or city if it is known that the study respondents live on the reserve itself. Similarly, researchers studying the elderly population may wish to select PCCF records that selectively include seniors’ residences. Therefore, using the population distribution of Canada to conduct the random allocation of census geography, particularly in rural areas, presents an important limitation of the PCCF+ program for specialized health datasets where the distribution may be different from that of the general population.

Limitations

This study depended on postal codes captured by the 2016 Census of Population. The postal code variable is derived from three possible sources: the original Statistics Canada address files (i.e., the postal code used in the census mail-out), postal codes reported by respondents and postal codes mentioned on census form 3 (the personal questionnaire). Valid postal codes obtained from the address files used in mail-outs were used preferentially. In rare cases where a valid postal code was not obtained from any of the three sources, it was imputed using a donor postal code. Another source of error is that respondents may not necessarily have reported the postal code at their place of residence, and instead may have reported a business postal code or post office box.Note 20 The extent to which this occurs in the census dataset is not presently known.

Matches between postal codes and DAs were derived from the PCCF, which is regularly updated with all new postal codes introduced by Canada Post Corporation.Note 2 Many of these records are manually edited and verified by Statistics Canada as they are added. However, any positional errors in the original PCCF file that have not been corrected during manual edits would also translate to errors in the PCCF+ processing, and therefore affect DMT assignment and the DA–postal code counts.

Conclusions

This study highlights the importance of considering several geographic variables, including DMT, when determining the accuracy of matching postal codes to census geography. In general, the majority of Canadians are urban residents (69.4%) or have DMTs of type A or B (80.4%)—most of which were linked to a single DA (85.3% to 95.3%). Therefore, DA-level neighbourhood covariates can be relatively accurately assigned with postal codes in studies of urban Canada. A handful of specific CMAs, such as Barrie, Ontario, are less accurate.

However, in rural areas and areas with rural postal codes, where postal codes likely straddle the boundaries of several DAs, postal codes were most frequently matched to multiple DAs. In these cases, the PCCF+ uses a random allocation algorithm to select a DA and, at a population level, assign the DA in a way that replicates the distribution of the overall population. However, for smaller (i.e., less representative) health datasets, or for datasets where it might be important to accurately match the neighbourhood to individual covariates, postal codes may not provide enough geographic detail for geocoding at the DA level. In these cases, either supplemental geographic information (e.g., address files) or consolidation to a larger census geographic unit (e.g., CTs) might help to mitigate geocoding error.

References
Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

Privacy notice

Date modified: