Predicting Influenza-Like Illness in Kent County, Michigan

(Capstone research proposal)


Background: Local public health plays a pivotal role in control of emerging and re-emerging communicable diseases. Advancements in data-backed disease surveillance methods have been used to create forecasting models at broader scales; however, gaps remain in the literature about disease modeling in local public health. Techniques from existing research may enhance disease forecasting capacity in local public health departments.

Aims: This research aims to assess the impact of a county-specific predictive model of Influenza-Like Illness (ILI), incorporating predictor variables from heterogeneous online data streams, on ILI prediction in Kent County, Michigan, compared to historical trends alone.

Hypothesis: The model based on historical incidence, ED visits, local air/meteorological quality, and Google search trends will be significantly more accurate than the model based on historical incidence alone.

Methods: Several Kent County ILI forecasting models will be built, progressing from models based on historical incidence to those based on incidence, ED visits, local air/meteorological quality, and Google search trends. Model predictive powers will be assessed in comparison to one another.

Implications: This research aligns with CDC’s Center for Forecasting and Outbreak Analytics 2023-2028 Strategic plan. The anticipated results may lead to improved community health outcomes in Kent County, Michigan, due to improved ILI prevention and response in the local health department. Moreover, the findings will support further development of disease forecasting and advanced analytics in local public health by offering methodological insights and lessons learned for capacity-building initiatives moving forward.


Background & Significance

Emerging and re-emerging communicable diseases represent major challenges for public health systems worldwide. Local public health departments play a crucial role in disease surveillance and control, not only during epidemic-related emergencies such as the recent COVID-19 pandemic, but also during non-epidemic times. One of the main functions of local public health is to inform and develop mitigation and prevention strategies for endemic infectious diseases through the use of routine surveillance. Timeliness is a key component of effective prevention, with public health interventions occurring only after incidence is rapidly identified in the community. Methods which quickly identify disease incidence, particularly disease forecasting methods, are especially useful as tools in the local public health surveillance toolbox.

Advances in information technology and data-backed epidemiological modelling methods stand to strengthen disease forecasting capacity in local public health. Disease forecasting methods which combine diverse online data and traditional surveillance data are heavily discussed in the literature (Athanasiou, et al., 2023, Husnayain, et al., 2019, Kandula, et al., 2017, Mavragani & Ochoa, 2018, Soliman, et al., 2019), yet few manuscripts describe forecasting applications in finer spatial resolutions like local public health. Here we review Influenza-Like Illness (ILI) forecasting methods which involve the use of secondary data from multiple sources, and we describe gaps in the literature regarding its applicability in local public health.

            Google Search Trends is a popular tool for fetching online information in healthcare research. Online search behaviors, as reflected in Trends data, often correlate with disease incidence/prevalence. For instance, Mavragani & Ochoa (2018b) used linear regression to describe significant associations between Google Trends data and national and state-level Chlamydia and Hepatitis incidence. Further associations were found between AIDS-related Google search terms and AIDS prevalence in the US (Mavragani & Ochoa, 2018a). In Husnayain et al., (2019), a moving average analyses was used to describe a lagged relationship between Google Trends data and national reported Dengue incidence in Indonesia. Likewise, Kandula et al., (2017), used machine learning to develop state-level and regional-level ILI nowcast models with Google search trends data.

Weather and pollution related variables may also serve as useful early indicators of respiratory disease incidence. Ku et al., (2022), modeled daily ER visits for respiratory diseases based on climatic and air-pollution factors in urban Seoul between 2014 and 2019. Of the factors used in their model, atmospheric pressure, carbon monoxide, and exposure to inhalable particulate matter smaller than 2.5 micrometers (PM2.5) were positively associated with the number of respiratory disease patients, consistent with prior literature (Lu et al., 2020).

Wide variation exists in the methods and variables used for ILI forecasting. Improving model performance with machine learning was a common theme within the literature. Yang et al., (2023), leveraged climate and weather variables alongside internet search and social media data to develop a multivariate model of influenza incidence using machine learning techniques. Athanasiou, et al., (2023) harnessed the predictive power of weather variables alongside Twitter data and deep learning to accurately predict influenza in Greece. Real-time hospital admissions data are commonly combined with other sources of online data for syndromic disease surveillance, such as in Poirier et al., (2018), where admissions data were examined alongside Google trends data to accurately predict influenza incidence in France.

When communicable disease modeling results in multiple models with varying strengths and weaknesses, biostatisticians often use ensemble-based modeling methods to capitalize on the strengths from each individual model. Relatively few studies describe the application of this technique in local public health – one notable case study is Lu et al., (2018), where individual influenza forecasting models were built for the City of Boston using Google search trends, Twitter data, and insurance claims data. This ground-breaking approach included the strengths of each individual model to successfully apply influenza-forecasting methods at the local scale. Soliman et al., (2019), was another success. Here an ensemble approach was used to fuse multiple Google trends and meteorological data models to predict ILI in Dallas County, Texas. As Kandula et al., (2017), explains, ILI forecasting models built from local data predict incidence more accurately than the models built with extrapolated state data. Local models are indeed more relevant and useful for local public health disease forecasting.

County-specific ILI forecasting models which exploit the vast quantities of health-related data online have the potential to drastically improve public health interventions. Indeed, timely detection of ILI spikes may facilitate faster communication with stakeholders and enhance community decision-making, ultimately lowering disease burden in the community. Moreover, methods successful in one locality might be adapted in other local public health jurisdictions. Further research and applied development in the realm of local public health forecasting and analytics capacity is undoubtedly warranted.

The existing literature demonstrates the power of utilizing disparate internet data in disease surveillance at the state-level. This proposal aims to help fill the research gap in local applications by replicating many of the above-described techniques to build a predictive model of Influenza-Like Illness in Kent County as a case study.

Kent County is the fourth most populous Michigan county, with approximately 655k residents. The county has a vibrant manufacturing industry, with the City of Grand Rapids at its center. The ethnic makeup is White (78% ), followed by Black (10%) and Multiracial (7%). During the 2018-2019 influenza season, 2,650 cases of influenza were reported to the Kent County Health Department. Of those cases, 54% were female and 45% were male. Additionally, 46% of influenza cases were in children under the age of 18 (United States Census Bureau, n.d.). At the time of this writing, the Kent County Health Department uses emergency department syndromic surveillance to monitor and assess ILI activity and transmission in the community. Given its population, industrial profile, and established public health practices, Kent County is an excellent candidate for a disease forecasting case study.

This proposed research will assess if the above-described disease forecasting and analytics methods, applied to Kent County specific data, will detect/forecast ILI activity significantly sooner than the county’s currently utilized methods. If successful, this case study will be used to improve public health prevention and mitigation in the county and support further capacity-building initiatives in general local public health analytics and disease forecasting moving forward.


Study Design

This is a descriptive, ecological study which exploits multiple, readily-available secondary internet data streams alongside more traditional surveillance data, to build a predictive model of ILI incidence in Kent County, Michigan.

Data Description & Collection

ILI case surveillance data are provided by the state-wide communicable disease reporting system in Michigan (MDSS). These data represent physician-diagnosed and/or laboratory-diagnosed ILI. The target variable in the model will be future ILI cases in Kent County; historical ILI case counts will be used as independent predictors to represent the lagged-effect of past incidence on current ILI transmission in the community. Weekly, deidentified Kent County ILI case counts will be obtained via the MDSS database for the period between 2006-2019.

Real-time and historical ILI-related emergency department (ED) visits will be included as independent variables in the model to approximate community transmission. An ILI-related ED visit is defined as a visit to an emergency room or urgent care center with a chief complaint of fever and cough and/or sore throat and no other known cause other than ILI. Over 240 facilities across Michigan participate in the state-wide syndromic surveillance system, which is designed for early detection of seasonal illnesses and novel and non-reportable conditions. Daily, deidentified Kent County ILI-related ED visits will be obtained via the MSSS database for the period between 2006-2019.

The use of internet search trends in communicable disease forecasting has significant documentation in the literature (Husnayain, et al., 2019, Kandula, et al., 2017, Mavragani & Ochoa, 2018a, Mavragani & Ochoa, 2018b, Soliman, et al., 2019, Yang, et al., 2023). Here, Google search trends will be used as an indicator for early information-seeking behaviors in potentially symptomatic individuals. Weekly, normalized ILI-related search frequency data will be collected from the Health Trends Application Programming Interface (API), which was developed by Google to provide deeper access to search data for academic research. These APIs allow researchers to retrieve data directly from Google servers with tools coded in languages like Python. Kent County ILI search trends will be collected by combining 3 terms related to influenza: cough, cold, flu. Weekly data will be collected for the period between 2006-2019 and will be specific to the Grand Rapids – Kentwood – Muskegon combined statistical area.

Lastly, air quality and atmospheric/meteorological factors, which may influence respiratory disease susceptibility, will be included as predictor variables in the ILI model. Weekly air quality index (AQI) values will be fetched from the EPA Air Quality Index Daily Report and divided into 6 grades corresponding to different levels (Good to Hazardous) of health concern, as defined by the US Environmental Protection Agency (EPA). In addition, weekly main pollutants will be recorded as reported by EPA. The NOAA Climate Online Datamart will be used to access daily wind, precipitation, and temperature in Kent County between 2006-2020.

Data Management

            Institutional Review Board (IRB) approval is required for research which includes subject information such as race, gender, zip code, insurance type, and other potentially identifiable personal information. This proposed ecological research uses secondary, aggregated, community-level data with no personally identifiable information. As such, the study qualifies and will apply for IRB exemption. The scope of the study is limited to Kent County, Michigan, and will produce a modestly-sized dataset which may be stored within the local hard drives of facility computers.

Research Question & Objectives

This research will address the research gap in local applications of disease forecasting by answering the following question: Will a county-specific model of ILI incidence, incorporating local syndromic data, Google search trends, weather data, and air quality alongside historical ILI trends predict ILI significantly more accurately than a model based on historical trends alone? This research will address this question with the following objectives:

1.      Develop a predictive model of ILI in Kent County, using historical ILI incidence alone.

2.      Develop a predictive model of ILI in Kent County, using historical ILI incidence and current/historical ED visits.

3.      Develop a predictive model of ILI in Kent County, using historical ILI, current/historical ED visits, and heterogeneous indicator variables Google trends, weather, and air quality.

4.      Compare and assess model performances.

The model based on ILI time trends will act as the base control model. The Kent County Health Department currently utilizes ED visit ILI syndromic surveillance to monitor influenza in the county, so the model based on incidence time trends and ED visits will be a comparison model alongside the model based on all variables.

Analysis Plan

This analysis uses heterogeneous, multi-source data to develop an ILI forecasting model specific to Kent County, Michigan, using time series analysis methods. A time series is a series of data points at regular time intervals and ordered chronologically (Schaffer, et al., 2021). Time series data usually have three features: non-stationarity, autocorrelation, and seasonality (Schaffer, et al., 2021). This analysis will test for each of these features after first describing model variables and creating a basic model with time as a predictor variable.

Descriptives & Segmented Regression

Monthly means will be calculated for each continuous variables alongside boxplots to visualize variable spreads. Correlation analysis will be utilized to assess multicollinearity among continuous variables. Associations between month and main pollutants will be explored with Chi2 test of association and visualized with a stacked bar chart. ILI cases will be modeled with months alone, then with months and other variables.


            A series may be considered stationary if it maintains a constant mean, variance, and covariance over time (Schaffer, et al., 2021). In a disease time series, localized “shocks” may impact series stationarity, such as increasing community transmission during epidemic influenza seasons leading to higher incidence trends in the following months. As such, it is sometimes necessary to perform a first-distancing transformation on the series to examine the change in incidence from month-to-month for stationarity (i.e. Yt – Yt-1) (Schaffer, et al., 2021). The Ad-Fuller Test of Stationarity is used to test the hypothesis that a time series is nonstationary (Dickey, n.d.). ILI cases will be tested with Ad-Fuller, and if the series fails to show stationarity the test will be repeated on the first-differenced transformation of the series.


            Observations in a time series are often correlated with past observations, or lags, in what is known as autocorrelation  (Schaffer, et al., 2021). Time series data with autocorrelation are typically nonstationary, but first-differencing is typically enough to remove autocorrelation. Kent County ILI cases, and its first-differenced transformation, will be examined for autocorrelation with autocorrelation function plots (ACFs), which plot the correlations between each observation and previous values at specified time units called lags.

Autoregression & Seasonality

            In an autoregressive (AR) model, the outcome variable is predicted with one or more of its lagged values. Seasonality refers to time series periodicity related to specific time intervals. Seasonal monthly data usually has a period of 12 whereas we might expect to see a period of 52 in seasonal weekly data.

AR methods will be used to develop a seasonal model and a nonseasonal model of weekly ILI cases in Kent County. The better performing model will move forward to be integrated with ED visits, Google search trends, and air/meteorological values as independent predictors in a complete ILI forecasting model.

Expected Results

This research will produce a viable ILI forecasting model for Kent County, Michigan. Significant implications present about the applicability of advanced analytics and forecasting in local public health if the model outperforms current forecasting methods. Improving forecasting capacity in local public health is included in the CDC Center for Forecasting and Outbreak Analytics 2023-2028 Strategic Plan. The methodology and lessons learned in this research may be used to inform further local public health capacity-building, irrespective of findings.