3 Setting up the data
There is a standard set of exclusion criteria we apply to all surveillance analyses. In this particular case, we do delete these records from the dataset, since they technically are not part of the populations we were sampling.
We remove observations that satisfy the following criteria:
CENTER
is not equal to"F"
,"J"
,"M"
, or"W"
RACE
is not equal to"B"
or"W"
CENTER
is equal to"M"
or"W"
andRACE
is not equal to"W"
AGE2
is not in your particular age range for your analysis (e.g., for CHD analysis of 1987 - 2014, we would exclude participants with age less than 35 or older than 74)GENDER
is not equal to"F"
or"M"
SAMWT
is less than 0EVTDAT4
is less than 0- year of
EVTDAT4
is less than 1987 or more than 2014 CODESTRAT
is missing
Conditions 2 and 3 remove ~4,200 observations, while the other conditions remove a very small number.
Second, we define a grouping of the AGE2 variable as follows (SAS
code):
*** AGEGRP *** ;
if 35 <= age2 <= 39 then do;agegrp = '35-39';end;
else if 40 <= age2 <= 44 then do;agegrp = '40-44';end;
else if 45 <= age2 <= 49 then do;agegrp = '45-49';end;
else if 50 <= age2 <= 54 then do;agegrp = '50-54';end;
else if 55 <= age2 <= 59 then do;agegrp = '55-59';end;
else if 60 <= age2 <= 64 then do;agegrp = '60-64';end;
else if 65 <= age2 <= 69 then do;agegrp = '65-69';end;
else if 70 <= age2 <= 74 then do;agegrp = '70-74';end;
else agegrp = ' ';
3.1 Analyses that do not estimate rates
For analyses that are not estimating rates of events (i.e., do not use the population living in the ARIC catchment areas as the denominator), there are no unusual steps needed to set up the dataset. As long as statistical models that account for the complex survey design are used, everything is fine.
3.2 Analyses that estimate rates
When estimating rates of events in the ARIC catchment areas, things get a little more complicated. The population at risk must be used as the denominator. At CSCC, we basically apply the sampling strategy to the counts in the population so that when we calculate crude rates or adjusted rates using models, the appropriate denominator is already taken into account. In poisson models for rates, this denominator is called the offset of the model.
3.2.1 Calculating the offset
Let’s say that you are analyzing the rate of MI for each year, adjusting for age, race, and sex. Therefore, you need to know the population counts for each combination of year, age, race, and sex.
The aricpop6
files contains this information:
## # A tibble: 176 x 37
## CENTER RACE GENDER AGEGRP P2000 P2010 P1990 P1991 P1992 P1993 P1994 P1995 P1996 P1997 P1998 P1999 P1987
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F B F 35-39 3541 3508 3055 3105 3155 3205 3255 3305 3355 3405 3455 3505 2836
## 2 F B F 40-44 3416 3552 2509 2602 2695 2787 2880 2973 3065 3158 3251 3343 1989
## 3 F B F 45-49 2977 3737 1743 1868 1992 2117 2242 2366 2491 2615 2740 2865 1538
## 4 F B F 50-54 2534 3558 1480 1586 1691 1797 1902 2007 2113 2218 2324 2429 1398
## 5 F B F 55-59 1690 3087 1312 1347 1382 1417 1453 1488 1523 1558 1593 1629 1303
## 6 F B F 60-64 1388 2530 1223 1237 1251 1265 1279 1293 1307 1321 1335 1349 1219
## 7 F B F 65-69 1218 1617 1171 1175 1179 1182 1186 1190 1194 1198 1201 1205 1141
## 8 F B F 70-74 1034 1176 977 983 988 994 999 1004 1010 1015 1021 1026 1041
## 9 F B F 75-79 822 922 839 837 835 833 831 829 827 825 823 821 NA
## 10 F B F 80-84 570 687 519 524 529 534 539 544 549 554 558 563 NA
## # … with 166 more rows, and 20 more variables: P1988 <dbl>, P1989 <dbl>, P2001 <dbl>, P2002 <dbl>, P2003 <dbl>,
## # P2004 <dbl>, P2005 <dbl>, P2006 <dbl>, P2007 <dbl>, P2008 <dbl>, P2009 <dbl>, P2011 <dbl>, P2012 <dbl>,
## # P2013 <dbl>, P2014 <dbl>, P2015 <dbl>, P2016 <dbl>, P2017 <dbl>, P2018 <dbl>, P2019 <dbl>
However, this file should be transposed into “long” format in order to be more useful:
pop_long <- pop %>%
group_by(CENTER, RACE, GENDER, AGEGRP) %>%
gather(year, population_count, starts_with("P")) %>%
mutate(YEAR = str_sub(year, start = 2)) %>%
select(-year)
pop_long
## # A tibble: 5,808 x 6
## # Groups: CENTER, RACE, GENDER, AGEGRP [176]
## CENTER RACE GENDER AGEGRP population_count YEAR
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 F B F 35-39 3541 2000
## 2 F B F 40-44 3416 2000
## 3 F B F 45-49 2977 2000
## 4 F B F 50-54 2534 2000
## 5 F B F 55-59 1690 2000
## 6 F B F 60-64 1388 2000
## 7 F B F 65-69 1218 2000
## 8 F B F 70-74 1034 2000
## 9 F B F 75-79 822 2000
## 10 F B F 80-84 570 2000
## # … with 5,798 more rows
Since we are conducting an analysis that does not adjust for center, we need the count of the population within each year/race/gender/agegrp across all centers.
denominator_counts <- pop_long %>%
ungroup() %>%
group_by(YEAR, RACE, GENDER, AGEGRP) %>%
summarise(
denominator = sum(population_count)
) %>%
filter(!(AGEGRP %in% c("75-79", "80-84", "85+"))) # Remove the older age group
denominator_counts
## # A tibble: 1,056 x 5
## # Groups: YEAR, RACE, GENDER [132]
## YEAR RACE GENDER AGEGRP denominator
## <chr> <chr> <chr> <chr> <dbl>
## 1 1987 B F 35-39 7501
## 2 1987 B F 40-44 5121
## 3 1987 B F 45-49 4006
## 4 1987 B F 50-54 3515
## 5 1987 B F 55-59 3224
## 6 1987 B F 60-64 2971
## 7 1987 B F 65-69 2711
## 8 1987 B F 70-74 2328
## 9 1987 B M 35-39 6594
## 10 1987 B M 40-44 4465
## # … with 1,046 more rows
Now we need to apply the sampling design to these denominators. First, we perform the same tally in our sample from the s14evt1
dataset1:
sample_counts <- s14 %>%
count(YEAR, RACE, AGEGRP, GENDER) %>%
rename(sample_count = n)
sample_counts
## # A tibble: 896 x 5
## YEAR RACE AGEGRP GENDER sample_count
## <chr> <chr> <chr> <chr> <int>
## 1 1987 B 35-39 F 5
## 2 1987 B 35-39 M 10
## 3 1987 B 40-44 F 12
## 4 1987 B 40-44 M 11
## 5 1987 B 45-49 F 15
## 6 1987 B 45-49 M 24
## 7 1987 B 50-54 F 26
## 8 1987 B 50-54 M 27
## 9 1987 B 55-59 F 35
## 10 1987 B 55-59 M 40
## # … with 886 more rows
new_weight <- inner_join(denominator_counts, sample_counts) %>%
mutate(new_weight = denominator / sample_count)
## Warning: Column `RACE` has different attributes on LHS and RHS of join
## Warning: Column `GENDER` has different attributes on LHS and RHS of join
new_weight
## # A tibble: 896 x 7
## # Groups: YEAR, RACE, GENDER [112]
## YEAR RACE GENDER AGEGRP denominator sample_count new_weight
## <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 1987 B F 35-39 7501 5 1500.
## 2 1987 B F 40-44 5121 12 427.
## 3 1987 B F 45-49 4006 15 267.
## 4 1987 B F 50-54 3515 26 135.
## 5 1987 B F 55-59 3224 35 92.1
## 6 1987 B F 60-64 2971 46 64.6
## 7 1987 B F 65-69 2711 55 49.3
## 8 1987 B F 70-74 2328 66 35.3
## 9 1987 B M 35-39 6594 10 659.
## 10 1987 B M 40-44 4465 11 406.
## # … with 886 more rows
Finally, the proper offset, or denominator of the rates for your model, can be arrived at by dividing the new_weight
variable by your weight variable, which is usually SAMWT_TRIM
.
## Warning: Column `GENDER` has different attributes on LHS and RHS of join
## Warning: Column `RACE` has different attributes on LHS and RHS of join
## # A tibble: 93,985 x 7
## GENDER RACE AGEGRP YEAR new_weight SAMWT_TRIM offset
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 M W 60-64 1987 30.2 1 30.2
## 2 M W 70-74 1987 19.3 1 19.3
## 3 F W 70-74 1987 33.0 1 33.0
## 4 F W 65-69 1987 50.0 1 50.0
## 5 M W 65-69 1987 24.1 1 24.1
## 6 F W 70-74 1987 33.0 1 33.0
## 7 F W 70-74 1987 33.0 1 33.0
## 8 M W 60-64 1987 30.2 1 30.2
## 9 M B 70-74 1987 32.2 1 32.2
## 10 F W 65-69 1987 50.0 1 50.0
## # … with 93,975 more rows
There is some data wrangling that must be done to create the
AGEGRP
variable, rename the race and gender variables, and exclude records that didn’t meet the age range criteria, were not white or black, and were black but at Washington Co or the Minnesota sites.↩