3 Setting up the data

There is a standard set of exclusion criteria we apply to all surveillance analyses. In this particular case, we do delete these records from the dataset, since they technically are not part of the populations we were sampling.

We remove observations that satisfy the following criteria:

CENTER is not equal to "F", "J", "M", or "W"
RACE is not equal to "B" or "W"
CENTER is equal to "M" or "W" and RACE is not equal to "W"
AGE2 is not in your particular age range for your analysis (e.g., for CHD analysis of 1987 - 2014, we would exclude participants with age less than 35 or older than 74)
GENDER is not equal to "F" or "M"
SAMWT is less than 0
EVTDAT4 is less than 0
year of EVTDAT4 is less than 1987 or more than 2014
CODESTRAT is missing

Conditions 2 and 3 remove ~4,200 observations, while the other conditions remove a very small number.

Second, we define a grouping of the AGE2 variable as follows (SAS code):

*** AGEGRP  *** ;
        if      35 <= age2 <= 39 then do;agegrp = '35-39';end;
        else if 40 <= age2 <= 44 then do;agegrp = '40-44';end;
        else if 45 <= age2 <= 49 then do;agegrp = '45-49';end;
        else if 50 <= age2 <= 54 then do;agegrp = '50-54';end;
        else if 55 <= age2 <= 59 then do;agegrp = '55-59';end;
        else if 60 <= age2 <= 64 then do;agegrp = '60-64';end;
        else if 65 <= age2 <= 69 then do;agegrp = '65-69';end;
        else if 70 <= age2 <= 74 then do;agegrp = '70-74';end;
        else agegrp = ' ';

3.1 Analyses that do not estimate rates

For analyses that are not estimating rates of events (i.e., do not use the population living in the ARIC catchment areas as the denominator), there are no unusual steps needed to set up the dataset. As long as statistical models that account for the complex survey design are used, everything is fine.

3.2 Analyses that estimate rates

When estimating rates of events in the ARIC catchment areas, things get a little more complicated. The population at risk must be used as the denominator. At CSCC, we basically apply the sampling strategy to the counts in the population so that when we calculate crude rates or adjusted rates using models, the appropriate denominator is already taken into account. In poisson models for rates, this denominator is called the offset of the model.

3.2.1 Calculating the offset

Let’s say that you are analyzing the rate of MI for each year, adjusting for age, race, and sex. Therefore, you need to know the population counts for each combination of year, age, race, and sex.

The aricpop6 files contains this information:

## # A tibble: 176 x 37
##    CENTER RACE  GENDER AGEGRP P2000 P2010 P1990 P1991 P1992 P1993 P1994 P1995 P1996 P1997 P1998 P1999 P1987
##    <chr>  <chr> <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 F      B     F      35-39   3541  3508  3055  3105  3155  3205  3255  3305  3355  3405  3455  3505  2836
##  2 F      B     F      40-44   3416  3552  2509  2602  2695  2787  2880  2973  3065  3158  3251  3343  1989
##  3 F      B     F      45-49   2977  3737  1743  1868  1992  2117  2242  2366  2491  2615  2740  2865  1538
##  4 F      B     F      50-54   2534  3558  1480  1586  1691  1797  1902  2007  2113  2218  2324  2429  1398
##  5 F      B     F      55-59   1690  3087  1312  1347  1382  1417  1453  1488  1523  1558  1593  1629  1303
##  6 F      B     F      60-64   1388  2530  1223  1237  1251  1265  1279  1293  1307  1321  1335  1349  1219
##  7 F      B     F      65-69   1218  1617  1171  1175  1179  1182  1186  1190  1194  1198  1201  1205  1141
##  8 F      B     F      70-74   1034  1176   977   983   988   994   999  1004  1010  1015  1021  1026  1041
##  9 F      B     F      75-79    822   922   839   837   835   833   831   829   827   825   823   821    NA
## 10 F      B     F      80-84    570   687   519   524   529   534   539   544   549   554   558   563    NA
## # … with 166 more rows, and 20 more variables: P1988 <dbl>, P1989 <dbl>, P2001 <dbl>, P2002 <dbl>, P2003 <dbl>,
## #   P2004 <dbl>, P2005 <dbl>, P2006 <dbl>, P2007 <dbl>, P2008 <dbl>, P2009 <dbl>, P2011 <dbl>, P2012 <dbl>,
## #   P2013 <dbl>, P2014 <dbl>, P2015 <dbl>, P2016 <dbl>, P2017 <dbl>, P2018 <dbl>, P2019 <dbl>

However, this file should be transposed into “long” format in order to be more useful:

pop_long <- pop %>%
  group_by(CENTER, RACE, GENDER, AGEGRP) %>%
  gather(year, population_count, starts_with("P")) %>%
  mutate(YEAR = str_sub(year, start = 2)) %>%
  select(-year)
pop_long

## # A tibble: 5,808 x 6
## # Groups:   CENTER, RACE, GENDER, AGEGRP [176]
##    CENTER RACE  GENDER AGEGRP population_count YEAR 
##    <chr>  <chr> <chr>  <chr>             <dbl> <chr>
##  1 F      B     F      35-39              3541 2000 
##  2 F      B     F      40-44              3416 2000 
##  3 F      B     F      45-49              2977 2000 
##  4 F      B     F      50-54              2534 2000 
##  5 F      B     F      55-59              1690 2000 
##  6 F      B     F      60-64              1388 2000 
##  7 F      B     F      65-69              1218 2000 
##  8 F      B     F      70-74              1034 2000 
##  9 F      B     F      75-79               822 2000 
## 10 F      B     F      80-84               570 2000 
## # … with 5,798 more rows

Since we are conducting an analysis that does not adjust for center, we need the count of the population within each year/race/gender/agegrp across all centers.

denominator_counts <- pop_long %>%
  ungroup() %>%
  group_by(YEAR, RACE, GENDER, AGEGRP) %>%
  summarise(
    denominator = sum(population_count)
  ) %>%
  filter(!(AGEGRP %in% c("75-79", "80-84", "85+")))  # Remove the older age group
denominator_counts

## # A tibble: 1,056 x 5
## # Groups:   YEAR, RACE, GENDER [132]
##    YEAR  RACE  GENDER AGEGRP denominator
##    <chr> <chr> <chr>  <chr>        <dbl>
##  1 1987  B     F      35-39         7501
##  2 1987  B     F      40-44         5121
##  3 1987  B     F      45-49         4006
##  4 1987  B     F      50-54         3515
##  5 1987  B     F      55-59         3224
##  6 1987  B     F      60-64         2971
##  7 1987  B     F      65-69         2711
##  8 1987  B     F      70-74         2328
##  9 1987  B     M      35-39         6594
## 10 1987  B     M      40-44         4465
## # … with 1,046 more rows

Now we need to apply the sampling design to these denominators. First, we perform the same tally in our sample from the s14evt1 dataset¹:

sample_counts <- s14 %>%
  count(YEAR, RACE, AGEGRP, GENDER) %>%
  rename(sample_count = n)
sample_counts

## # A tibble: 896 x 5
##    YEAR  RACE  AGEGRP GENDER sample_count
##    <chr> <chr> <chr>  <chr>         <int>
##  1 1987  B     35-39  F                 5
##  2 1987  B     35-39  M                10
##  3 1987  B     40-44  F                12
##  4 1987  B     40-44  M                11
##  5 1987  B     45-49  F                15
##  6 1987  B     45-49  M                24
##  7 1987  B     50-54  F                26
##  8 1987  B     50-54  M                27
##  9 1987  B     55-59  F                35
## 10 1987  B     55-59  M                40
## # … with 886 more rows

new_weight <- inner_join(denominator_counts, sample_counts) %>%
  mutate(new_weight = denominator / sample_count)

## Warning: Column `RACE` has different attributes on LHS and RHS of join

## Warning: Column `GENDER` has different attributes on LHS and RHS of join

new_weight

## # A tibble: 896 x 7
## # Groups:   YEAR, RACE, GENDER [112]
##    YEAR  RACE  GENDER AGEGRP denominator sample_count new_weight
##    <chr> <chr> <chr>  <chr>        <dbl>        <int>      <dbl>
##  1 1987  B     F      35-39         7501            5     1500. 
##  2 1987  B     F      40-44         5121           12      427. 
##  3 1987  B     F      45-49         4006           15      267. 
##  4 1987  B     F      50-54         3515           26      135. 
##  5 1987  B     F      55-59         3224           35       92.1
##  6 1987  B     F      60-64         2971           46       64.6
##  7 1987  B     F      65-69         2711           55       49.3
##  8 1987  B     F      70-74         2328           66       35.3
##  9 1987  B     M      35-39         6594           10      659. 
## 10 1987  B     M      40-44         4465           11      406. 
## # … with 886 more rows

Finally, the proper offset, or denominator of the rates for your model, can be arrived at by dividing the new_weight variable by your weight variable, which is usually SAMWT_TRIM.

## Warning: Column `GENDER` has different attributes on LHS and RHS of join

## Warning: Column `RACE` has different attributes on LHS and RHS of join

## # A tibble: 93,985 x 7
##    GENDER RACE  AGEGRP YEAR  new_weight SAMWT_TRIM offset
##    <chr>  <chr> <chr>  <chr>      <dbl>      <dbl>  <dbl>
##  1 M      W     60-64  1987        30.2          1   30.2
##  2 M      W     70-74  1987        19.3          1   19.3
##  3 F      W     70-74  1987        33.0          1   33.0
##  4 F      W     65-69  1987        50.0          1   50.0
##  5 M      W     65-69  1987        24.1          1   24.1
##  6 F      W     70-74  1987        33.0          1   33.0
##  7 F      W     70-74  1987        33.0          1   33.0
##  8 M      W     60-64  1987        30.2          1   30.2
##  9 M      B     70-74  1987        32.2          1   32.2
## 10 F      W     65-69  1987        50.0          1   50.0
## # … with 93,975 more rows

There is some data wrangling that must be done to create the AGEGRP variable, rename the race and gender variables, and exclude records that didn’t meet the age range criteria, were not white or black, and were black but at Washington Co or the Minnesota sites.↩