1. Project Outline

The data set selected for analysis for this project is titled “WFIGS - Wildland Fire Locations Full History”, created by the National Interagency Fire Center (NIFC). It contains data for all reported wildland fires in the United States. Data prior to the adoption of IRWIN (Integrated Reporting of Wildland Fire Information) in 2014 is included in this data set, but is incomplete and being incorporated as an ongoing project.

The data set can be found at the link below:

data-nifc.opendata.arcgis.com/datasets/nifc::wfigs-wildland-fire-locations-full-history/about

This data set was chosen for this project as wildfire management is a subject of personal and professional interest, as I have 4 seasons of previous wildland firefighting experience.

This analysis aims to explore trends among different wildfire incidents in the western United States and discover how their parameters are connected. There is not just one motivating question but multiple. Some questions include, but are not limited to:

1.1 - Data Cleaning and Parsing

The first step taken in this analysis was to clean and parse the 215,997-row, 90-column .csv file containing the wildfire data. Although this data set contained an immense amount of data, many of the entries were incomplete, empty, or irrelevant for the purposes of this analysis. 

The data was then exported to RStudio as a .csv file, and loaded into a data frame named “Wildfire”.

The parameters of interest were then loaded into a separate data frame called “fire”. The parameters selected for analysis include:

  • Calculated Acres (Fire Size)

  • Fire Discovery Date & Time

  • POO State (POO = Point Of Origin)

  • POO Landowner

  • Primary Fuel Model

  • Secondary Fuel Model

  • Total Incident Personnel

  • Estimated Cost To Date (Cost of fire suppression)

  • Fire Cause

  • Discovery Acres

fire<-data.frame(Wildfire$CalculatedAcres, Wildfire$EstimatedCostToDate, Wildfire$FireBehaviorGeneral, Wildfire$FireCause,Wildfire$FireCauseGeneral, Wildfire$FireDiscoveryDateTime, Wildfire$POOLandownerCategory, Wildfire$POOLandownerKind, Wildfire$POOState, Wildfire$PrimaryFuelModel, Wildfire$SecondaryFuelModel, Wildfire$TotalIncidentPersonnel, Wildfire$DiscoveryAcres)

Next, the data frame was further filtered using dplyr such that it only contained wildfires with a point of origin within the western United States, including Alaska. This data frame was name “westFires”.

library(dplyr)
westFires <- fire %>% filter(grepl('US-WA|US-OR|US-ID|US-MT|US-CA|US-NV|US-CO|US-AK|US-WY|US-UT|US-AZ|US-NM', fire$Wildfire.POOState))

This left the data frame with a total of 156,511 entries.

Much of the data entries had empty strings or null parameters. These empty parameters were changed to using the code below.

westFires[westFires==""]<-NA
westFires[westFires=="#"]<-NA
westFires[westFires=="null"]<-NA

2. Distribution of Discrete Attributes

The first topic of analysis was the distribution of discrete attributes among the data set. These include the distribution of fire cause, landowner kind, State of origin, and primary fuel model.


2.1 - Wildfire Causes

table(westFires$Wildfire.FireCause, useNA = "ifany")
## 
##        Human      Natural Undetermined      Unknown         <NA> 
##        50323        39656        24581        26803        15148
barplot(table(westFires$Wildfire.FireCause))


From this table, we can see that the majority of the wildfires in this data set have a human cause.
 

2.2 - Specific Wildfire Causes

table(westFires$Wildfire.FireCauseGeneral, useNA = "ifany")
## 
##                           Arson                         Camping 
##                               1                            3648 
## Cause and Origin Not Identified                       Coal Seam 
##                             858                              82 
##         Debris and open burning       Debris Burning (Fire Use) 
##                              35                             201 
##             Debris/Open Burning                       Equipment 
##                            2936                            2776 
##       Equipment and vehicle use                Firearms/Weapons 
##                               9                             572 
##                      Incendiary   Investigated but Undetermined 
##                            1029                             591 
##                       Lightning                      Misc/Other 
##                            9207                             215 
##                         Natural               Other Human Cause 
##                             305                            5800 
##             Other Natural Cause                        Railroad 
##                             461                             217 
##         Recreation and ceremony                         Smoking 
##                               2                             234 
##                    Undetermined Undetermined (remarks required) 
##                            1494                              15 
##                       Utilities                            <NA> 
##                             832                          124991

The most common single wildfire cause in this data set is lightning(9,207). However, most of the wildfires in the data set do not have a specific cause listed(12,491).
 

2.3 - Point of Origin Landowner Category

table(westFires$Wildfire.POOLandownerCategory, useNA = "ifany")
## 
##  "null"   ANCSA     BIA     BLM     BOR    City  County     DOD     DOE Foreign 
##       1     305   14074   15536     133     249     324     407      17       2 
##     NPS  OthLoc Private   State  Tribal    USFS   USFWS    <NA> 
##    2705      63   32400    2703    2837   37130     980   46645

The majority of wildfires occurred on private and United States Forest Service (USFS) land, followed by Bureau of Land Management (BLM) and Bureau of Indian Affairs (BIA).
 

2.4 - Point of Origin State

table(westFires$Wildfire.POOState, useNA = "ifany")
## 
## US-AK US-AZ US-CA US-CO US-ID US-MT US-NM US-NV US-OR US-UT US-WA US-WY 
##  4017 15869 48117  8351  9492 14977  7960  4950 14242  9703 11008  7825
barplot(table(westFires$Wildfire.POOState),space = 0.5, width = 1)


The states which have the highest number of wildfires in this data set are California, Arizona, Montana, Oregon, and Washington.
 

2.5 - Primary Fuel Model

table(westFires$Wildfire.PrimaryFuelModel, useNA = "ifany")
## 
##                 Brush (2 feet)             Chaparral (6 feet) 
##                           1299                            517 
##           Closed Timber Litter  Dormant Brush, Hardwood Slash 
##                            110                             55 
##                Hardwood Litter            Heavy Logging Slash 
##                             99                             33 
##            Light Logging Slash           Medium Logging Slash 
##                             13                             28 
##           Short Grass (1 foot)                 Southern Rough 
##                           1278                              1 
##          Tall Grass (2.5 feet)  Timber (Grass and Understory) 
##                            664                           1213 
## Timber (Litter and Understory)                           <NA> 
##                           1516                         149685

The majority of wildfires in this data set have timber as their primary fuel model, followed by grass and brush.

3. Distribution of Quanititative Attributes

Next, the distribution of quantitative attributes within the data set were analyzed. This includes Calculated Acres, Estimated Cost to Date, Total Incident Personnel, and Discovery Acres.

3.1 - Calculated Acres

summary(westFires$Wildfire.CalculatedAcres, useNA = "ifany")
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      3.3    113.4   7396.2   1881.0 963405.4   152744

The mean fire size for this data set is over 7,396 acres, a number which is likely highly influenced by the presence of megafires in the data set (Fires 100,000 acres and above).

boxplot(westFires$Wildfire.CalculatedAcres)


Due to the large amount of outliers, the box plot above does not tell us much about the distribution of fire size. By separating the box plot by state and changing the y-scale to log 10, the box plot should become more readable.

library(ggplot2)
 firesize <- ggplot(westFires %>% filter(!is.na(Wildfire.CalculatedAcres)), aes(x=Wildfire.POOState, y=Wildfire.CalculatedAcres, fill=Wildfire.POOState)) + 
  geom_boxplot()+ scale_y_log10(
   breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10", scales::math_format(10^.x)))+
  labs(title="Fire Size versus State of Origin",x="State", y = "Fire Size")
 firesize + theme_classic()+ theme(panel.grid.minor.y = element_line(color = 1, size = 0.05, linetype = 1)) + theme(panel.grid.major.y = element_line(color = 1, size = 0.1, linetype = 1))


From this chart, we can see that Alaska and Nevada have comparably higher median fire sizes ~(1,000-10,000 acres), while California, Montana, Washington and Wyoming have comparably lower median fire sizes ~(10-100 acres).

3.2 - Estimated Cost to Date

summary(westFires$Wildfire.EstimatedCostToDate)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         0      8000    120000   2971414    800000 637428216    149466

The mean cost of fire suppression per fire in this is 2,971,414 USD. The most expensive fire in this data set was the 2021 August Complex, at a cost of 637,428,216 USD.

The boxplot below shows the separation of fire suppression cost by state, using the same technique as used above for fire state versus origin.

library(ggplot2)
 firecost <- ggplot(westFires %>% filter(!is.na(Wildfire.EstimatedCostToDate)), aes(x=Wildfire.POOState, y=Wildfire.EstimatedCostToDate, fill=Wildfire.POOState)) + 
  geom_boxplot()+ scale_y_log10(
   breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10", scales::math_format(10^.x))
 ) +
  labs(title="Fire Cost versus State of Origin",x="State", y = "Fire Cost")
 firecost + theme_classic()+ theme(panel.grid.minor.y = element_line(color = 1, size = 0.05, linetype = 1)) + theme(panel.grid.major.y = element_line(color = 1, size = 0.1, linetype = 1))


From the graph, we can see that Alaska has the lowest median fire suppression costs at slightly above 10,000 USD, while California has the highest median suppression cost at slightly above 1,000,000 USD.

4. Regression Analysis of Wildfire Parameters

This next section explores if there exists any correlation between the parameters of calculated acres, estimated cost, discovery acres, and total incident personnel.

The first step taken in this regression analysis was to create a scatterplot matrix to see if there was any possible correlation visibly present between these parameters.

pairs(westFires[,c(1,2,12,13)])


No correlation between quantitative parameters and readily apparent from this chart. However, there appears to be a possible relation between fire size and total cost.

4.1 - Wildfire Cost Versus Calculated Acres

ggplot(westFires, aes(x = Wildfire.CalculatedAcres, y = Wildfire.EstimatedCostToDate)) + geom_point(size = 0.3)


The two quantitative parameters which most closely resemble some sort of correlation is wildfire suppression cost and calculated acres, but any correlation is likely very weak.

summary(lm(westFires$Wildfire.CalculatedAcres ~ westFires$Wildfire.EstimatedCostToDate))
## 
## Call:
## lm(formula = westFires$Wildfire.CalculatedAcres ~ westFires$Wildfire.EstimatedCostToDate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -472788   -6903   -6002   -2918  484995 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            6.216e+03  7.876e+02   7.892 4.89e-15
## westFires$Wildfire.EstimatedCostToDate 1.156e-03  3.168e-05  36.491  < 2e-16
##                                           
## (Intercept)                            ***
## westFires$Wildfire.EstimatedCostToDate ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34090 on 1981 degrees of freedom
##   (154528 observations deleted due to missingness)
## Multiple R-squared:  0.402,  Adjusted R-squared:  0.4017 
## F-statistic:  1332 on 1 and 1981 DF,  p-value: < 2.2e-16

With an R-Squared value of 0.4017, there exists a weak correlation between fire size and fire cost. Although fire size does affect fire cost, other factors such as fuel type and location also play large factors. For example, wildfires in lighter fuels such as grass and brush spread much more quickly, thus they often burn more acres than heavier, less readily ignitable fuels such as timber. However, fire suppression costs in timber is usually higher per acre, as suppression is much more time and resource intensive than lighter fuels. If we separate the data by primary fuel model, will the correlation between fire cost and fire size be more apparent?

library(dplyr)
Grass <- westFires  %>% filter(grepl('Short Grass|Tall Grass', westFires$Wildfire.PrimaryFuelModel))
Brush <- westFires  %>% filter(grepl('Brush', westFires$Wildfire.PrimaryFuelModel))
Timber <- westFires  %>% filter(grepl('Timber', westFires$Wildfire.PrimaryFuelModel))
Chaparral <- westFires  %>% filter(grepl('Chaparral', westFires$Wildfire.PrimaryFuelModel))

Note: Chaparral is a highly flammable scrubland plant community composed of broad-leaved evergreen shrubs, bushes, and small trees usually less than 2.5 metres (about 8 feet) tall. It is commonly found in much of California.

4.1.1 - Grass Fire Acreage vs. Cost

plot(Grass$Wildfire.CalculatedAcres, Grass$Wildfire.EstimatedCostToDate)

summary(lm(Grass$Wildfire.CalculatedAcres ~ Grass$Wildfire.EstimatedCostToDate))
## 
## Call:
## lm(formula = Grass$Wildfire.CalculatedAcres ~ Grass$Wildfire.EstimatedCostToDate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -196772   -9262   -8031   -2070  286882 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.550e+03  1.456e+03   6.561 1.57e-10 ***
## Grass$Wildfire.EstimatedCostToDate 1.008e-03  1.271e-04   7.936 1.91e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29130 on 421 degrees of freedom
##   (1519 observations deleted due to missingness)
## Multiple R-squared:  0.1301, Adjusted R-squared:  0.1281 
## F-statistic: 62.98 on 1 and 421 DF,  p-value: 1.908e-14

With an R-Squared value of just 0.1281, there is no correlation between fire size and fire cost for grass fires.

4.1.2 - Brush Fire Acreage vs. Cost

plot(Brush$Wildfire.CalculatedAcres, Brush$Wildfire.EstimatedCostToDate)

summary(lm(Brush$Wildfire.CalculatedAcres ~ Brush$Wildfire.EstimatedCostToDate))
## 
## Call:
## lm(formula = Brush$Wildfire.CalculatedAcres ~ Brush$Wildfire.EstimatedCostToDate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47401  -7800  -6356  -2660 351602 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        6.569e+03  2.995e+03   2.193 0.029471 *  
## Brush$Wildfire.EstimatedCostToDate 2.230e-03  5.772e-04   3.864 0.000152 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37500 on 196 degrees of freedom
##   (1156 observations deleted due to missingness)
## Multiple R-squared:  0.07077,    Adjusted R-squared:  0.06603 
## F-statistic: 14.93 on 1 and 196 DF,  p-value: 0.0001518

With an R-Squared value of just 0.06603, there is no correlation between fire size and fire cost for brush fires.

4.1.3 - Timber Fire Acreage vs. Cost

plot(Timber$Wildfire.CalculatedAcres, Timber$Wildfire.EstimatedCostToDate)

summary(lm(Timber$Wildfire.CalculatedAcres ~ Timber$Wildfire.EstimatedCostToDate))
## 
## Call:
## lm(formula = Timber$Wildfire.CalculatedAcres ~ Timber$Wildfire.EstimatedCostToDate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -205610   -6337   -4045   -1967  485636 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         3.978e+03  1.087e+03   3.659 0.000266 ***
## Timber$Wildfire.EstimatedCostToDate 1.525e-03  4.041e-05  37.730  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33550 on 1018 degrees of freedom
##   (1819 observations deleted due to missingness)
## Multiple R-squared:  0.5831, Adjusted R-squared:  0.5826 
## F-statistic:  1424 on 1 and 1018 DF,  p-value: < 2.2e-16

With an R-Squared value of 0.5826, there is a weak/medium correlation between fire cost and fire size in timber fires. This value is notably higher than the previous R-Squared value for all fuel types (0.4017).

4.1.4 - Chaparral Fire Acreage vs. Cost

plot(Chaparral$Wildfire.CalculatedAcres, Chaparral$Wildfire.EstimatedCostToDate)

summary(lm(Chaparral$Wildfire.CalculatedAcres ~ Chaparral$Wildfire.EstimatedCostToDate))
## 
## Call:
## lm(formula = Chaparral$Wildfire.CalculatedAcres ~ Chaparral$Wildfire.EstimatedCostToDate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70251  -5017  -3231  -1236 260597 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            3.130e+03  4.370e+03   0.716    0.476
## Chaparral$Wildfire.EstimatedCostToDate 1.066e-03  1.053e-04  10.123  2.4e-15
##                                           
## (Intercept)                               
## Chaparral$Wildfire.EstimatedCostToDate ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34530 on 70 degrees of freedom
##   (445 observations deleted due to missingness)
## Multiple R-squared:  0.5941, Adjusted R-squared:  0.5883 
## F-statistic: 102.5 on 1 and 70 DF,  p-value: 2.399e-15

With an R-Squared value of 0.5883, there is a weak/medium correlation between fire cost and fire size in Chaparral fires. This R-Squared value is comparable to that of timber fires, as well as notably higher than the R-Squared value of all fuel types (0.4017).

4.1.5 - Summary

From this analysis, we can conclude that fire size does have a notable linear correlation to fire cost for heavier fuel types such as chaparral and timber, but no correlation for lighter fuel types such as grass and brush.

5. Reflection/Summary

5.1 - Visualization Design and Selection

For my more simple visualizations, using standard R functions(plot(), table(), barplot(), pairs(), etc.) were adequate. The more complex visualizations such as the bar plot charts used to compare fire size and fire cost versus state of origin required the use of ggplot functions in order to adequately visualize the data.

5.2 - Challenges Faced During Analysis

At 90 columns and almost 216,000 entries, the largest challenges I faced during the completion of this project stemmed from the large size and raw, unprocessed nature of the data set I selected. Through trial and error, I was eventually able to find the R libraries and functions I needed to clean and parse the data into data frame(s) which were adequate for analysis.

Although analysis of the distribution of the discrete parameters of this data set was relatively straight forward, the analysis and visualization of the distribution of and correlations between quantitative values proved more difficult. By dividing the data further into sub-categories such as state of origin and primary fuel model, the visualizations and statistical analysis became much easier to digest and draw conclusions from.

5.3 - Notable Results from Analysis

The results of this analysis which I found most surprising included the large difference in median fire size and fire cost between states, particularly the difference in median fire cost of Alaska (~10,000 USD) versus California ~(1,000,000 USD).

The second surprising result from this analysis is the enormous difference in linear correlation between fire size and fire cost when separated by different primary fuel types. While grass and brush fires had almost no correlation between fire size and fire cost(R-Squared = 0.06(brush) and 0.12(grass)), chaparral and timber (R-Squared = 0.58) had a moderate correlation.

5.4 - Possible Areas of Further Analysis

I feel I have only just scratched the surface of analyzing this data set. Some future analysis I would like to do with this data includes exploring more combinations of discrete parameters versus quantitative parameters, such as fire cost & fire size versus primary fuel model & landowner kind. I would also like to explore more how the other discrete parameters such as state and landowner kind can result in stronger or weaker correlations between fire cost and fire size. Multiple linear regression models for fire size and fire cost would also be interesting to explore further.