Can we use precipitation data to predict the orthophosphate levels in bodies of water?
You are probably wondering:
Orthophosphate is known as reactive phosphate and is a main ingredient in fertilizer. Basically, this stuff makes living stuff grow; it’s a nutrient that is “readily utilized by biota”. When it rains, orthophosphates can get into bodies of water through run-off (Soltis-Muth, n.d.).
In biology terms, an excess of nutrients can lead to eutrophication. You should care because this phenomenon is something bad for the environment and humans.
Wth great nutrients comes a whole lot of algae. (And responsibility as I’ll explain!)
Wow, another big word in the title! Here is a brief simplified overview of what eutrophication is.
There are many other problems such as an increase in pH that is associated with algae blooms (Chislock et al. 2013).
Let’s backtrack. Nutrients in run-off that came from fertilizers played a part in this catastrophe. So be careful, mindful, and responsible when you use fertilizers!
Still not convinced to care? I conceed that I have not discussed how this affects humans.
My motivation for investigating The Question is that: if we can predict orthophosphate levels using weather, perhaps we can limit the damaging effects of these blooms.
Getting data is no easy task for data scientists.
Often times if one has data, one has to clean or scrape that data. Other times one does not have data and has to spend hours finding suitable data. (Sometimes it is both.)
For this task, I spent many hours (possibly 6 or more) before I found data that was both good quality and was related to a good question!
At first I looked across different dataset repositories such as https://www.data.gov. I did not have any luck until I came across AWS open data: https://registry.opendata.aws.
Some candidate datasets I considered using were:
Although AWS open data is a great place for data, the amount became overwhelming; I spent much of my time looking there. Unfortunately, I did not find any data of my liking.
These were just some of the datasets I found not suitable.
I decided to search a bit more local. Perhaps the county or state government would have some good data. Indeed I found good data on Maryland’s open data: https://opendata.maryland.gov.
I ended up settling with the Water Point Source Sample data:
https://opendata.maryland.gov/Energy-and-Environment/Water-Point-Source-Sample-Results/eqs6-savc
Now all I needed was precipitation data. Since I now knew exactly what I was looking for, I found it pretty quickly. I will also be using data from NOAA’s Climate Data Online:
https://www.ncdc.noaa.gov/cdo-web/
https://www.ncdc.noaa.gov/cdo-web/datasets
The documentation can be found here:
https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/PRECIP_HLY_documentation.pdf
I will have included both datasets in my Github repository. If you wish to go through the process yourself, here are the details.
This is a free dataset but we still have to “order” it. The steps to obtain this dataset are as follows:
Your parameters should look something like the following.
Now that we have found the data, it is a good idea to load them. It will help in describing the data when the metadata is not clear.
First, I will define what packages you’ll need. It is helpful to know this at the start.
# tibble related libraries
library(tidyverse)
library(tidyr)
# date processing
library(lubridate)
# string manipulation
library(stringr)
# map capabilities
library(leaflet)
Water Point Source Sample
# do not worry about this, explanation is in the data description
m_types <- cols(
`Org Name` = col_character(),
`Station ID` = col_character(),
County = col_character(),
HUC = col_double(),
`Station Horizontal Datum` = col_character(),
`Activity ID` = col_character(),
`Activity Start` = col_character(),
`Activity Medium` = col_character(),
`Activity Type` = col_character(),
`Activity Category-Rep Num` = col_character(),
`Characteristic Name` = col_character(),
`Sample Fraction` = col_character(),
`Value Type` = col_character(),
`Result Value Status` = col_character(),
`Result Value` = col_double(),
Units = col_character(),
`Analytical Proc ID` = col_character(),
`Location 1` = col_character(),
Counties = col_double()
)
water_df <- read_csv("Water_Point_Source_Sample_Results.csv",
col_types = m_types) %>%
# parse the strings into date objects
# format is month day year hours:minutes:seconds
mutate(the_date = mdy_hms(`Activity Start`))
NOAA Precipitation
# do not worry about this, explanation is in the data description
m_types <- cols(
STATION = col_character(),
STATION_NAME = col_character(),
ELEVATION = col_double(),
LATITUDE = col_double(),
LONGITUDE = col_double(),
# here the data is in this format: year-month-day hours:minutes:seconds
DATE = col_datetime(format = "%Y%m%d %H:%M"),
HPCP = col_double(),
`Measurement Flag` = col_character(),
`Quality Flag` = col_character()
)
rain_df <- read_csv("precipitation.csv", col_types = m_types)
I am using two datasets. The primary data (with our “target” orthophosphate attribute) is Water Point Source Sample data and the auxillary data is NOAA Precipitation data.
This biggest issue with this dataset is that the metadata only includes the suggested data type (that may or may not be correct). The only two types listed are “text” and “number”.
Without any description, I have to either infer or do analysis to gain more context. We will see below that there is much pre-analysis even before exploratory data analysis.
water_df %>% head()
## # A tibble: 6 x 20
## `Org Name` `Station ID` County HUC `Station Horizo… `Activity ID`
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Maryland … MD0021598 ALLEG… 2.07e6 NAD83 FY12-PS903
## 2 Maryland … MD0021598 ALLEG… 2.07e6 NAD83 FY12-PS903
## 3 Maryland … MD0021598 ALLEG… 2.07e6 NAD83 FY12-PS903
## 4 Maryland … MD0021598 ALLEG… 2.07e6 NAD83 FY12-PS903
## 5 Maryland … MD0021598 ALLEG… 2.07e6 NAD83 FY12-PS902
## 6 Maryland … MD0021598 ALLEG… 2.07e6 NAD83 FY12-PS902
## # … with 14 more variables: `Activity Start` <chr>, `Activity
## # Medium` <chr>, `Activity Type` <chr>, `Activity Category-Rep
## # Num` <chr>, `Characteristic Name` <chr>, `Sample Fraction` <chr>,
## # `Value Type` <chr>, `Result Value Status` <chr>, `Result Value` <dbl>,
## # Units <chr>, `Analytical Proc ID` <chr>, `Location 1` <chr>,
## # Counties <dbl>, the_date <dttm>
As you can see below, there are 163,574 samples.
water_df %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 163574
Below you can see that all samples are taken on the first day of every month for 5 years (There are 60 distinct dates and 60 / 12 = 5).
water_df %>%
group_by(the_date) %>%
summarize(samples = n())
## # A tibble: 60 x 2
## the_date samples
## <dttm> <int>
## 1 2008-07-01 00:00:00 2701
## 2 2008-08-01 00:00:00 2667
## 3 2008-09-01 00:00:00 2707
## 4 2008-10-01 00:00:00 2678
## 5 2008-11-01 00:00:00 2686
## 6 2008-12-01 00:00:00 2720
## 7 2009-01-01 00:00:00 2742
## 8 2009-02-01 00:00:00 2746
## 9 2009-03-01 00:00:00 2778
## 10 2009-04-01 00:00:00 2756
## # … with 50 more rows
Data Type: Text
The organization name. This is not particularly interesting since all the samples are under “Maryland Point data”. I will definitely be dropping this column.
(This attribute name is somewhat of a misnomer since the value isn’t really an organization.)
water_df %>%
group_by(`Org Name`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Org Name` samples
## <chr> <int>
## 1 Maryland Point data 163574
Data Type: Nominal Categorical
The unique identifier of the station. There are a total of 272 stations.
water_df %>%
group_by(`Station ID`) %>%
summarize(samples = n()) %>%
arrange(desc(samples))
## # A tibble: 272 x 2
## `Station ID` samples
## <chr> <int>
## 1 MD0000311 780
## 2 MD0021687 780
## 3 MD0001775 768
## 4 MD0001422 756
## 5 MD0003221 660
## 6 MD0020001 660
## 7 MD0020010 660
## 8 MD0020052 660
## 9 MD0020095 660
## 10 MD0020168 660
## # … with 262 more rows
Data Type: Nominal Categorical
The county in Maryland that the station is located in.
You can see below the number of samples in the data for each county. There are some listed as unspecified that might need to be filtered out.
water_df %>%
group_by(`County`) %>%
summarize(samples = n()) %>%
arrange(desc(samples))
## # A tibble: 24 x 2
## County samples
## <chr> <int>
## 1 FREDERICK 19994
## 2 CECIL 15488
## 3 ANNE ARUNDEL 13888
## 4 WASHINGTON 12820
## 5 CARROLL 9362
## 6 CHARLES 9209
## 7 GARRETT 9100
## 8 ALLEGANY 8362
## 9 BALTIMORE 7130
## 10 PRINCE GEORGE'S 6080
## # … with 14 more rows
Data Type: Nominal Categorical
(see “Hydrologic Unit Maps” 2018)
Data Type: Nominal Categorical
(see “The Elements of Geodesy: The Horizontal Datum” 2004)
As you can see below, there are no stations that explicitly use NAD27 but there are some unknown entries that might need to be filtered out.
water_df %>%
group_by(`Station Horizontal Datum`) %>%
summarize(samples = n())
## # A tibble: 2 x 2
## `Station Horizontal Datum` samples
## <chr> <int>
## 1 NAD83 163496
## 2 UNKWN 78
Data Type: Nominal Categorical
This attribute was not documented well. So I will first provide some preliminary analysis for determining its purpose.
You can see below that the activity ID does not uniquely identify any particular observation. Recall there are 163,574 samples but here we only see 16,167 IDs.
water_df$`Activity ID` %>% n_distinct()
## [1] 16167
Let’s look at the most frequent activity IDs.
act_id_samples <- water_df %>%
group_by(`Activity ID`) %>%
summarize(samples = n())
act_id_samples %>%
arrange(desc(samples)) %>%
head()
## # A tibble: 6 x 2
## `Activity ID` samples
## <chr> <int>
## 1 FY09-PS1009 13
## 2 FY09-PS1010 13
## 3 FY09-PS1011 13
## 4 FY09-PS1012 13
## 5 FY09-PS1013 13
## 6 FY09-PS1014 13
Well that didn’t tell us much. What about the least frequent?
tmp <- act_id_samples %>%
arrange(samples) %>%
head()
tmp
## # A tibble: 6 x 2
## `Activity ID` samples
## <chr> <int>
## 1 FY09-PS100 1
## 2 FY09-PS102 1
## 3 FY09-PS104 1
## 4 FY09-PS105 1
## 5 FY09-PS109 1
## 6 FY09-PS110 1
I have saved the result in a variable. Since there is only one of each, I can safely look at the all the observations for each without being flooded with data.
water_df %>%
# selects samples where the activity ID is in the table previously generated
filter(`Activity ID` %in% tmp$`Activity ID`) %>%
# these are not very useful so I excluded them
select(-`Org Name`, -`Station Horizontal Datum`)
## # A tibble: 6 x 18
## `Station ID` County HUC `Activity ID` `Activity Start`
## <chr> <chr> <dbl> <chr> <chr>
## 1 MD0003158-O… CHARL… 2.07e6 FY09-PS104 04/01/2009 12:0…
## 2 MD0003158-O… CHARL… 2.07e6 FY09-PS105 05/01/2009 12:0…
## 3 MD0003158-O… CHARL… 2.07e6 FY09-PS109 05/01/2009 12:0…
## 4 MD0003158-O… CHARL… 2.07e6 FY09-PS110 06/01/2009 12:0…
## 5 MD0003158-O… CHARL… 2.07e6 FY09-PS100 04/01/2009 12:0…
## 6 MD0003158-O… CHARL… 2.07e6 FY09-PS102 04/01/2009 12:0…
## # … with 13 more variables: `Activity Medium` <chr>, `Activity
## # Type` <chr>, `Activity Category-Rep Num` <chr>, `Characteristic
## # Name` <chr>, `Sample Fraction` <chr>, `Value Type` <chr>, `Result
## # Value Status` <chr>, `Result Value` <dbl>, Units <chr>, `Analytical
## # Proc ID` <chr>, `Location 1` <chr>, Counties <dbl>, the_date <dttm>
These samples look all exactly the same except for the station ID and activity ID. (Based on some attributes not shown.) We will see later that the “outfall” stations are again “troublemakers”.
I now believe the activity ID may be related to the station ID. Below I find the number of unique activity IDs each station uses.
water_df %>%
group_by(`Station ID`) %>%
summarize(num_unique = n_distinct(`Activity ID`))
## # A tibble: 272 x 2
## `Station ID` num_unique
## <chr> <int>
## 1 MD0000311 60
## 2 MD0000469 60
## 3 MD0001201 60
## 4 MD0001384 60
## 5 MD0001422 60
## 6 MD0001775 60
## 7 MD0003158-OUTFALL 007 60
## 8 MD0003158-OUTFALL 010 61
## 9 MD0003158-OUTFALL 021 60
## 10 MD0003158-OUTFALL 025 60
## # … with 262 more rows
I know where I have seen 60 before! Recall this is the total number of unique sampling dates. There are some anomalies such as 36 or 61 unique activity IDs. (From data not shown.)
Just to make sure, I find the number of activity IDs used by stations in 2009 below. I am expecting each station to use 12 activity IDs in a year.
water_df %>%
filter(year(the_date) == 2009) %>%
group_by(`Station ID`) %>%
summarize(num_unique = n_distinct(`Activity ID`))
## # A tibble: 270 x 2
## `Station ID` num_unique
## <chr> <int>
## 1 MD0000311 12
## 2 MD0000469 12
## 3 MD0001201 12
## 4 MD0001384 12
## 5 MD0001422 12
## 6 MD0001775 12
## 7 MD0003158-OUTFALL 007 12
## 8 MD0003158-OUTFALL 010 12
## 9 MD0003158-OUTFALL 021 12
## 10 MD0003158-OUTFALL 025 12
## # … with 260 more rows
I picked 2009 based on that there was a full year of data. It seemed “safe” compared to 2008 which does not have a full year of data (the data starts July 2008).
At this point, I am almost satisfied. Let’s list what we observed:
I actually had first arrived at a false conclusion. I do not state it and to correct myself requires some future knowledge:
I pick an arbitrary station, the first of the previous output, and analyze how many activity IDs it used.
water_df %>%
filter(year(the_date) == 2009, `Station ID` == "MD0000311") %>%
group_by(`Activity ID`) %>%
summarize(num_unique = n_distinct(`Characteristic Name`))
## # A tibble: 12 x 2
## `Activity ID` num_unique
## <chr> <int>
## 1 FY09-PS13 13
## 2 FY09-PS14 13
## 3 FY09-PS15 13
## 4 FY09-PS16 13
## 5 FY09-PS17 13
## 6 FY09-PS18 13
## 7 FY10-PS19 13
## 8 FY10-PS20 13
## 9 FY10-PS21 13
## 10 FY10-PS22 13
## 11 FY10-PS23 13
## 12 FY10-PS24 13
The above output shows that for each of the 12 IDs the station used, each were used 13 times (the same number as the number of characteristics).
I conclude that it is most probable that:
Each activity ID is used by a single station for a particular sampling day. For any sampling date, you can link all the characteristics collected by the same station (on that date) using the activity ID.
(I might drop this column since we’ll only be looking at the characteristic “orthophosphate”.)
Data Type: Date
This is the date of when the sample was taken. As mentioned before, samples are taken the first day of each month. There are 5 years of data and so there are 60 days where samples were taken.
I have converted this attribute to a date and will use the new attribute “the_date” in place of this.
Data Type: Nominal Categorical
The medium or substance we are testing for characterstics within. In this case we are only testing aspects of water so this is not very interesting. You can see below that the only medium is water. (This column will be dropped.)
water_df %>%
group_by(`Activity Medium`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Activity Medium` samples
## <chr> <int>
## 1 Water 163574
Data Type: Nominal Categorical
The type of activity being performed. In this case, we have data of routine sampling of water. As you can see below, the only type is “Sample-Routine”. (This column will be dropped.)
water_df %>%
group_by(`Activity Type`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Activity Type` samples
## <chr> <int>
## 1 Sample-Routine 163574
Data Type: Nominal Categorical
This column appears to be a duplicate of “Activity Type”. From the name, it appears that there should be numerical values.
However, as you can see below, we only have “Sample-Routine” as the sole value. Therefore, I have marked this as text data.
water_df %>%
group_by(`Activity Category-Rep Num`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Activity Category-Rep Num` samples
## <chr> <int>
## 1 Sample-Routine 163574
The metadata does not help other than saying this contains text. Without context, I cannot deduce what this attribute is for. (This column will be dropped.)
Data Type: Nominal Categorical
This is the characteristic or aspect of the water being measured. For example, there are samples with measurements for dissolved oxygen in water.
You can see below the number of samples for each characteristic name.
water_df %>%
group_by(`Characteristic Name`) %>%
summarize(samples = n())
## # A tibble: 13 x 2
## `Characteristic Name` samples
## <chr> <int>
## 1 Ammonia-nitrogen 14665
## 2 Biochemical oxygen demand, standard conditions 14629
## 3 Chemical oxygen demand 277
## 4 Dissolved oxygen (DO) 14775
## 5 Flow 16166
## 6 Inorganic nitrogen (nitrate and nitrite) 14704
## 7 Kjeldahl nitrogen 14704
## 8 Nitrogen 14704
## 9 Organic carbon 264
## 10 Organic Nitrogen 14665
## 11 Orthophosphate 14617
## 12 Phosphorus 14629
## 13 Total suspended solids 14775
Data Type: Nominal Categorical
Describes what kind of “fraction” the measurement is of. For example, some measurements such as dissolved oxygen only measure the fraction of dissolved oxygen in the water.
You can see below the number of samples for each kind of sample fraction.
water_df %>%
group_by(`Sample Fraction`) %>%
summarize(samples = n())
## # A tibble: 3 x 2
## `Sample Fraction` samples
## <chr> <int>
## 1 Dissolved 14775
## 2 Suspended 14775
## 3 Total 134024
Data Type: Nominal Categorical
As you can see below, the only value this attribute takes is “Actual”. My best guess is that there could of been values that are estimates.
It seems that I should take the name of this attribute at face value. This attribute simply describes the type of value that is in “Result Value”. (Type as in observed vs estimated, etc.)
(I will be dropping this column.)
water_df %>%
group_by(`Value Type`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Value Type` samples
## <chr> <int>
## 1 Actual 163574
Data Type: Nominal Categorical
As you can see below, the only value this attribute takes is “Final”. Again, it seems that this attribute should be taken at face value.
This attribute describes the status of the “Result Value” and indicates whether the “Result Value” is fully processed or needs processing.
(I will be dropping this column.)
water_df %>%
group_by(`Result Value Status`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Result Value Status` samples
## <chr> <int>
## 1 Final 163574
Data Type: Continuous Numerical
This attribute is the measured/observed value.
For example, if a sample is measuring the dissolved oxygen characteristic, the corresponding value is a quantative value of the amount of dissolved oxygen.
Data Type: Nominal Categorical
The units the “Result Value” is measured in.
Below you can see the two units used are “mg/l” and “mgd”.
water_df %>%
group_by(`Units`) %>%
summarize(samples = n())
## # A tibble: 2 x 2
## Units samples
## <chr> <int>
## 1 mg/l 147408
## 2 mgd 16166
Data Type: Nominal Categorical
My best guess is that this attribute is the “Analytical Process ID”. Without context, I cannot deduce what this attribute is for. (This column will be dropped.)
As you can see below, the only value is “MDPOINTSOURCE~MDE-DMR”.
water_df %>%
group_by(`Analytical Proc ID`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Analytical Proc ID` samples
## <chr> <int>
## 1 MDPOINTSOURCE~MDE-DMR 163574
Here is what I have gathered so far. Looking at the “~” delimeter, on the left is the name of this dataset “MD POINT SOURCE” and on the right is “MDE-DMR”.
MDE might stand for the Maryland Department of Environment:
https://mde.maryland.gov/Pages/index.aspx
DMR might stand for discharge monitoring report and I actually found another dataset (which I won’t be using because it’s out of scope of this “tutorial”):
https://echo.epa.gov/tools/data-downloads/icis-npdes-dmr-and-limit-data-set
Data Type: Geolocation
After reviewing a few latitude + longitude coordinates on maps, I am fairly certain that this is the location of the station itself (rather than where the sample of the body of water was taken).
Below is a few examples of the format. Parsing will be needed.
water_df %>%
group_by(`Location 1`) %>%
summarize(samples = n())
## # A tibble: 266 x 2
## `Location 1` samples
## <chr> <int>
## 1 "MARYLAND\n(37.966389, -76.018333)" 660
## 2 "MARYLAND\n(37.970278, -75.858333)" 660
## 3 "MARYLAND\n(37.989167, -76.041389)" 660
## 4 "MARYLAND\n(38.0625, -76.3325)" 630
## 5 "MARYLAND\n(38.083333, -75.583333)" 610
## 6 "MARYLAND\n(38.09, -75.42)" 660
## 7 "MARYLAND\n(38.094167, -75.799444)" 660
## 8 "MARYLAND\n(38.1008, -75.73)" 54
## 9 "MARYLAND\n(38.149722, -76.439167)" 660
## 10 "MARYLAND\n(38.158333, -75.838333)" 660
## # … with 256 more rows
Datatype: Discrete Numerical
This column was not included/listed in the metadata. The name is vague. Without context, I cannot deduce what this attribute is for.
Regardless, I will at least attempt a guess. Counties seems to have something to do with “County”. The table below counts the number of samples for each County and Counties pair.
water_df %>%
group_by(County, Counties) %>%
summarize(samples = n())
## # A tibble: 53 x 3
## # Groups: County [24]
## County Counties samples
## <chr> <dbl> <int>
## 1 ALLEGANY 417 7516
## 2 ALLEGANY 2806 90
## 3 ALLEGANY 2862 756
## 4 ANNE ARUNDEL 418 13228
## 5 ANNE ARUNDEL 1815 660
## 6 BALTIMORE 421 660
## 7 BALTIMORE 1763 5030
## 8 BALTIMORE 1857 1440
## 9 CALVERT 1764 1270
## 10 CALVERT 2656 660
## # … with 43 more rows
The table above has not given me much more insight. I do see that there are NA values for Counties. (From data not shown.) To the best of my knowledge, it is safe to drop this column.
Unlike the Water Point Source Sample data, the NOAA Precipitation data does have very good documentation as linked in Data Acquisition under The Data (“Hourly Precipitation Data Documentation” 2016).
Below is a preview of the data.
rain_df %>% head()
## # A tibble: 6 x 9
## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE
## <chr> <chr> <dbl> <dbl> <dbl> <dttm>
## 1 COOP:1… BALTIMORE W… 47.5 39.2 -76.7 2008-01-01 01:00:00
## 2 COOP:1… BALTIMORE W… 47.5 39.2 -76.7 2008-01-03 11:00:00
## 3 COOP:1… BALTIMORE W… 47.5 39.2 -76.7 2008-01-03 12:00:00
## 4 COOP:1… BALTIMORE W… 47.5 39.2 -76.7 2008-01-05 19:00:00
## 5 COOP:1… BALTIMORE W… 47.5 39.2 -76.7 2008-01-05 20:00:00
## 6 COOP:1… BALTIMORE W… 47.5 39.2 -76.7 2008-01-05 23:00:00
## # … with 3 more variables: HPCP <dbl>, `Measurement Flag` <chr>, `Quality
## # Flag` <chr>
As you can see below, there are 13,533 samples.
rain_df %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 13533
One last important thing to note: According to the documentation, only hours when rain occurred are entered into this dataset (“Hourly Precipitation Data Documentation” 2016).
Data Type: Nominal Categorical
Cooperative identification number. This attribute uniquely identifies a station.
The table below shows the number of samples that each station has contributed to the dataset. There are 7 stations in total.
rain_df %>%
group_by(STATION) %>%
summarize(samples = n())
## # A tibble: 7 x 2
## STATION samples
## <chr> <int>
## 1 COOP:180015 800
## 2 COOP:180465 6540
## 3 COOP:180700 1532
## 4 COOP:183090 816
## 5 COOP:185718 1115
## 6 COOP:185934 1652
## 7 COOP:188065 1078
Data Type: Nominal Categorical
According to the documentation, this is: “A name identifying the station location” (“Hourly Precipitation Data Documentation” 2016).
The table below shows the number of samples associated with each station name in the dataset. There are 7 stations names in total and this matches up with the stations.
rain_df %>%
group_by(STATION_NAME) %>%
summarize(samples = n())
## # A tibble: 7 x 2
## STATION_NAME samples
## <chr> <int>
## 1 ABERDEEN PHILLIPS FIELD MD US 800
## 2 BALTIMORE WASHINGTON INTERNATIONAL AIRPORT MD US 6540
## 3 BELTSVILLE MD US 1532
## 4 FEDERALSBURG MD US 816
## 5 MARYLAND SCIENCE CENTER MD US 1115
## 6 MILLERS 4 NE MD US 1652
## 7 SAVAGE RIVER DAM MD US 1078
Data Type: Continuous Numerical
As the name of this attribute suggests, this is the elevation above sea level.
Data Type: Geolocation
The latitude geographic coordinate of the station.
Data Type: Geolocation
The longitude geographic coordinate of the station.
Data Type: Date
The date that the measurement ended.
We might run into a“logical” problem here since the last measurment possible in the day is “00:00” (“Hourly Precipitation Data Documentation” 2016). The issue may arise if we are trying to perform operations grouped by the entire day.
Data Type: Continuous Numerical
The amount of precipitation that the station measured. This value is cummulative. Since we asked for metric, this is in tenths of millimeters. Lastly, missing values are 99999 (“Hourly Precipitation Data Documentation” 2016).
Data Type: Nominal Categorical
Indicates potential problems or characteristics of the measurement. For example, “M” is for missing data and “g” is used for the very first entry. This is empty if “no flag is needed” (“Hourly Precipitation Data Documentation” 2016).
Data Type: Nominal Categorical
Indicates characteristics of the quality of measurment. For example, “R” indicates that one of the “NCDC’s quality control tests” failed (“Hourly Precipitation Data Documentation” 2016).
I will be detailing my data management. I will be discussing missing data.
My goals after processing the data is as follows:
The first thing I will do is filter so we’ll only have the orthophosphate characterstic. I’ll go ahead and also drop the columns I said I would definitely drop.
ortho_df <- water_df %>%
filter(`Characteristic Name` == "Orthophosphate") %>%
# note I am dropping this name because we already know it
select(-`Characteristic Name`,
-`Org Name`,
-`Activity Medium`,
-`Activity Type`,
-`Activity Category-Rep Num`,
-`Value Type`,
-`Result Value Status`,
-`Analytical Proc ID`,
-`Counties`,
# dropping this because we already have it parsed as "the_date"
-`Activity Start`)
Another column I suspect we can drop is sample fraction.
ortho_df %>%
group_by(`Sample Fraction`) %>%
summarize(samples = n())
## # A tibble: 1 x 2
## `Sample Fraction` samples
## <chr> <int>
## 1 Total 14617
As you can see above, this is no longer telling us any additional information. I drop the column below.
ortho_df <- ortho_df %>% select(-`Sample Fraction`)
The last thing I will do before missing data is parse the longitude and latitude. Note that I am still keeping “Location 1” as comparing doubles for exact match is not as easy as strings.
ortho_df <- ortho_df %>%
mutate(loc1 = str_sub(`Location 1`, start = 9)) %>%
separate(loc1, c("lat","lng"), sep = ",") %>%
mutate(lat=as.numeric(str_replace(lat, "\\(", ""))) %>%
mutate(lng=as.numeric(str_replace(lng, "\\)", "")))
ortho_df %>%
select(`Station ID`, lat, lng) %>%
head()
## # A tibble: 6 x 3
## `Station ID` lat lng
## <chr> <dbl> <dbl>
## 1 MD0021598 39.6 -78.8
## 2 MD0021610 39.4 -77.4
## 3 MD0021610 39.4 -77.4
## 4 MD0021610 39.4 -77.4
## 5 MD0021601 39.2 -76.6
## 6 MD0021610 39.4 -77.4
At this point we can start handling missing data.
I saved the process of handling missing data until the now because filtering and dropping unnecessary columns first can simplify this process. (You could be handling missing data that wouldn’t even be there afterwards!)
Of course I had to be careful with filtering. Imputing the mean after filtering can have very different results compared to the reverse. I only filtered for the “orthophosphate” characteristic. This does not affect with values like the mean value for that particular characteristic.
I analyze the attributes in two batches so the results fits in the page.
ortho_df %>%
summarize(sta_id_na = sum(is.na(`Station ID`)),
county_na = sum(is.na(County)),
huc_na =sum(is.na( HUC)),
datum = sum(is.na(`Station Horizontal Datum`)),
act_id_na = sum(is.na(`Activity ID`)))
## # A tibble: 1 x 5
## sta_id_na county_na huc_na datum act_id_na
## <int> <int> <int> <int> <int>
## 1 0 0 6 0 0
For this set of attributes, I see there are 6 samples with the HUC (hydrologic unit code) NA. I am not too worried as I do not plan to use this in my final predictor.
ortho_df %>%
summarize(date_na = sum(is.na(the_date)),
res_val_na = sum(is.na(`Result Value`)),
unit_na = sum(is.na(`Units`)),
lat_na = sum(is.na(lat)),
lng_na = sum(is.na(lng)))
## # A tibble: 1 x 5
## date_na res_val_na unit_na lat_na lng_na
## <int> <int> <int> <int> <int>
## 1 0 0 0 0 0
If you recall in the data description, I found there were NAs that were in character format. Below I investigate.
ortho_df %>% filter(`Station Horizontal Datum` == "UNKWN")
## # A tibble: 3 x 11
## `Station ID` County HUC `Station Horizo… `Activity ID` `Result Value`
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 MD0022993 UNSPE… NA UNKWN FY09-PS1399 2.52
## 2 MD0022993 UNSPE… NA UNKWN FY09-PS1400 2.52
## 3 MD0022993 UNSPE… NA UNKWN FY09-PS1401 2.52
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## # the_date <dttm>, lat <dbl>, lng <dbl>
ortho_df %>% filter(County == "UNSPECIFIED")
## # A tibble: 6 x 11
## `Station ID` County HUC `Station Horizo… `Activity ID` `Result Value`
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 MD0022993 UNSPE… NA UNKWN FY09-PS1399 2.52
## 2 MD0022993 UNSPE… NA UNKWN FY09-PS1400 2.52
## 3 MD0022993 UNSPE… NA UNKWN FY09-PS1401 2.52
## 4 MD0066184 UNSPE… NA NAD83 FY09-PS3085 0.7
## 5 MD0066184 UNSPE… NA NAD83 FY09-PS3086 0.8
## 6 MD0066184 UNSPE… NA NAD83 FY09-PS3087 1.1
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## # the_date <dttm>, lat <dbl>, lng <dbl>
It looks like this all goes back to the HUC. Recall there there were 6 samples with HUC as NA. Let’s see if we can drop these two stations altogether.
ortho_df %>% filter(`Station ID` == "MD0066184")
## # A tibble: 3 x 11
## `Station ID` County HUC `Station Horizo… `Activity ID` `Result Value`
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 MD0066184 UNSPE… NA NAD83 FY09-PS3085 0.7
## 2 MD0066184 UNSPE… NA NAD83 FY09-PS3086 0.8
## 3 MD0066184 UNSPE… NA NAD83 FY09-PS3087 1.1
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## # the_date <dttm>, lat <dbl>, lng <dbl>
ortho_df %>% filter(`Station ID` == "MD0022993")
## # A tibble: 3 x 11
## `Station ID` County HUC `Station Horizo… `Activity ID` `Result Value`
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 MD0022993 UNSPE… NA UNKWN FY09-PS1399 2.52
## 2 MD0022993 UNSPE… NA UNKWN FY09-PS1400 2.52
## 3 MD0022993 UNSPE… NA UNKWN FY09-PS1401 2.52
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## # the_date <dttm>, lat <dbl>, lng <dbl>
It seems that these stations each only have 3 data points. This will not be enough for our predictor anyway. (I plan to use data across the 5 years.) I remove these stations below.
ortho_df <- ortho_df %>%
filter(!(`Station ID` %in% c("MD0066184", "MD0022993")))
So far I like this data! From the results above, it seems there is almost no missing data (we will see that this is not true for the precipitation dataset).
The very first thing to look at for this dataset is the Measurment Flag and Quality Flag. According to NOAA’s documentation, some data is imprecise, missing, or “indicators”. Some rows are simply just telling us when the rain began! (No value associated.)
Below we can see that some rows have the quality flag “R”.
rain_df %>%
group_by(`Quality Flag`) %>%
summarize(samples = n())
## # A tibble: 2 x 2
## `Quality Flag` samples
## <chr> <int>
## 1 <NA> 13499
## 2 R 34
As mentioned in the data description, “R” indicates the measurment failed a NCDC quality control test.
As stated prior, my plan is to use the average of the accumulated rain only for hours when it rained for each month.
The noise from this data should be negligble because there are only 34 samples and we are computing the average over a month. Thus, I will disregard this flag.
Let’s begin constructing our transformed data. I will begin by dropping the quality flag column.
rain_tidy <- rain_df %>% select(-`Quality Flag`)
Now onto the measurement flag. We can see below that over a quarter of the data has some special flag.
rain_tidy %>%
group_by(`Measurement Flag`) %>%
summarize(samples = n())
## # A tibble: 9 x 2
## `Measurement Flag` samples
## <chr> <int>
## 1 <NA> 9502
## 2 [ 182
## 3 ] 184
## 4 { 96
## 5 } 96
## 6 a 7
## 7 A 7
## 8 g 315
## 9 T 3144
Let’s tackle the easiest first. As just mentioned, I only want to average the hours when it does rain. According to the documentation, “T” is for trace amounts. I will consider this as not raining. Below I remove these rows.
rain_tidy <- rain_tidy %>%
# notice that we want to keep NAs
# the added condition overrides the default filter behavior
filter(`Measurement Flag` != "T" | is.na(`Measurement Flag`))
The flag “g” is also fairly simple. The documentation says this is used on the first day when accumulations starts and the value is 0 (“Hourly Precipitation Data Documentation” 2016). Below I remove these rows.
rain_tidy <- rain_tidy %>%
# notice that we want to keep NAs
# the added condition overrides the default filter behavior
filter(`Measurement Flag` != "g" | is.na(`Measurement Flag`))
According to the documentation, “a” indicates when accumulation begins and “A” indicates when accumulation ends. There is no data associated when the flag is “a” and 99999 for “A” indicates “the accumulation continues into the next month” (“Hourly Precipitation Data Documentation” 2016).
Since the “a/A” flags only have 7 entries, I can just manually take a look. What I am concerned about is accumulation across months.
rain_tidy %>%
# the filter condition is true if measurement flag is "a" or "A"
filter(`Measurement Flag` %in% c("a", "A")) %>%
select(STATION, DATE, HPCP, `Measurement Flag`)
## # A tibble: 14 x 4
## STATION DATE HPCP `Measurement Flag`
## <chr> <dttm> <dbl> <chr>
## 1 COOP:183090 2010-08-05 01:00:00 25400. a
## 2 COOP:183090 2010-08-06 00:00:00 2.54 A
## 3 COOP:183090 2010-08-07 01:00:00 25400. a
## 4 COOP:183090 2010-08-08 00:00:00 15.2 A
## 5 COOP:183090 2010-08-08 01:00:00 25400. a
## 6 COOP:183090 2010-08-09 00:00:00 2.54 A
## 7 COOP:183090 2010-08-13 01:00:00 25400. a
## 8 COOP:183090 2010-08-14 00:00:00 12.7 A
## 9 COOP:183090 2010-08-18 01:00:00 25400. a
## 10 COOP:183090 2010-08-19 00:00:00 22.9 A
## 11 COOP:183090 2010-08-19 01:00:00 25400. a
## 12 COOP:183090 2010-08-20 00:00:00 15.2 A
## 13 COOP:183090 2010-08-23 01:00:00 25400. a
## 14 COOP:183090 2010-08-24 00:00:00 7.62 A
Oh no! If you look at the table above, the value is 25399.75 instead of 99999! Was the documentation wrong? Nope, it was just that when we asked for metric data, it divided everything. The original units were hundredth inches which is approximately 3.9 tenth millimeters (our current unit).
The good news is none of these carry into another month! Hurray! Let’s just drop the non-data “a” rows.
rain_tidy <- rain_tidy %>%
# notice that we want to keep NAs
# the added condition overrides the default filter behavior
filter(`Measurement Flag` != "a" | is.na(`Measurement Flag`))
The next is a little bit tricky. There is missing data encoded with “{}” and “[]”. This can be painful to process because these are intervals with two rows that indicate the start and stop.
What is really interesting is that not all the “[]” match! From a table above, there are 182 “[" but 184 “]”. My guess is that since we ordered a subset of the data, the extra “]” are ending the missing data from earlier (not in our subset).
Recall that I mentioned earlier that I want to compute some mean within a month.
Missing data is not problem if there are data points within that month. I can just impute the mean.
But if my end goal is to get the mean itself, then I can drop these rows! (Since the whole point of imputing the mean is to have the mean stay the same.)
The trickier problem comes if the data has a complete month missing. I want to do this “mean computation” for each month. If there is no data, then the mean itself is missing too!
In this case, I will look across years for that month and impute the mean from the mean of the other means! (I am assuming that a month, say January, will have similar weather over the years.)
Now we’ve reached a problem where we need to deal with dates. Recall that in the data description the “DATE” attribute is the end date. However, we run into the problem for edge cases where the measurement is taken at 12AM.
Ideally, we would want this rain data to count for the data before. (The rain happened and then the results were in once the clocked turned 12AM). Below we can see that this case indeed exists in our data.
rain_tidy %>%
filter(hour(DATE) == 0) %>%
select(STATION, DATE, HPCP, `Measurement Flag`)
## # A tibble: 601 x 4
## STATION DATE HPCP `Measurement Flag`
## <chr> <dttm> <dbl> <chr>
## 1 COOP:180465 2008-01-11 00:00:00 0.51 <NA>
## 2 COOP:180465 2008-01-18 00:00:00 0.25 <NA>
## 3 COOP:180465 2008-02-13 00:00:00 0.25 <NA>
## 4 COOP:180465 2008-03-05 00:00:00 0.51 <NA>
## 5 COOP:180465 2008-03-20 00:00:00 1.78 <NA>
## 6 COOP:180465 2008-04-04 00:00:00 0.51 <NA>
## 7 COOP:180465 2008-04-14 00:00:00 0.51 <NA>
## 8 COOP:180465 2008-04-27 00:00:00 1.52 <NA>
## 9 COOP:180465 2008-04-28 00:00:00 0.25 <NA>
## 10 COOP:180465 2008-05-09 00:00:00 7.87 <NA>
## # … with 591 more rows
There are 601 of these rows! Note that the POXIX format does not show the hour when the time is exactly on the start of the day (12AM).
The simplest solution is just to add a column for the starting date which I perform below.
rain_tidy <- rain_tidy %>% mutate(start_date = (DATE - hours(1)))
rain_tidy %>%
select(STATION, DATE, start_date, HPCP) %>%
head()
## # A tibble: 6 x 4
## STATION DATE start_date HPCP
## <chr> <dttm> <dttm> <dbl>
## 1 COOP:180465 2008-01-05 20:00:00 2008-01-05 19:00:00 0.25
## 2 COOP:180465 2008-01-06 02:00:00 2008-01-06 01:00:00 0.51
## 3 COOP:180465 2008-01-06 03:00:00 2008-01-06 02:00:00 0.76
## 4 COOP:180465 2008-01-06 04:00:00 2008-01-06 03:00:00 1.02
## 5 COOP:180465 2008-01-06 05:00:00 2008-01-06 04:00:00 1.78
## 6 COOP:180465 2008-01-06 06:00:00 2008-01-06 05:00:00 0.25
Now let’s get back to the missing data.
To see if there is actually some data that is not missing in a month, we can add a logical flag (0 or 1) for if that data is not missing. Then if we group and sum, the only possible value when all the data is missing is 0.
Note that when I say month, I am actually abbreviating for: a particular year’s month for a particular station.
rain_tidy <- rain_tidy %>%
# we consider the data not to be missing if the flag is NA
# or if the flag is not one of the missing flags
mutate(not_missing = (is.na(`Measurement Flag`) |
!(`Measurement Flag` %in% c("{", "}", "[", "]"))
)) %>%
group_by(STATION, year(start_date), month(start_date)) %>%
mutate(num_not_missing = sum(as.numeric(not_missing))) %>%
ungroup() %>%
# We only need to keep if num_not_missing is 0 since
# those are the samples we cannot compute the mean for.
# If num_not_missing != 0, we will implicitly handle this when
# we estimate the mean.
# (Recall that our "estimate" mean then would be missing.)
filter(not_missing | num_not_missing == 0)
The above code looks complex but all it does is handle dropping all the missing data we can safely drop.
Let’s see how much data were in the second case: where the entire month of data was missing for that station!
rain_tidy %>%
group_by(`Measurement Flag`) %>%
summarize(samples = n())
## # A tibble: 6 x 2
## `Measurement Flag` samples
## <chr> <int>
## 1 <NA> 9502
## 2 [ 136
## 3 ] 138
## 4 { 48
## 5 } 48
## 6 A 7
Wow, that’s quite a lot of stations with entire months of missing data!
Now I will compute the mean across all months for each station. I know there are 7 stations, 12 months in a year. So I expect 7 * 12 = 84 rows.
mean_df <- rain_tidy %>%
filter(not_missing) %>%
group_by(STATION, month(start_date)) %>%
summarize(avg = mean(HPCP)) %>%
rename(month = `month(start_date)`)
mean_df
## # A tibble: 84 x 3
## # Groups: STATION [7]
## STATION month avg
## <chr> <dbl> <dbl>
## 1 COOP:180015 1 2.83
## 2 COOP:180015 2 2.54
## 3 COOP:180015 3 2.78
## 4 COOP:180015 4 3.81
## 5 COOP:180015 5 3.51
## 6 COOP:180015 6 4.99
## 7 COOP:180015 7 5.20
## 8 COOP:180015 8 9.03
## 9 COOP:180015 9 4.39
## 10 COOP:180015 10 3.84
## # … with 74 more rows
Indeed we do get 84 rows. This is good. No station was super bad. We can deduce they all collected data for any month at least once in the span of 5 years.
We have everthing we need to start making the final summarized data. I am expecting 7 stations, 12 months in a year, and a span of 5 years.
7 * 12 * 6 = 504! So I expect 504 rows once we are done.
Let’s first construct a table with all the combinations. This will be the “skeleton” and tell use which values are implicitly missing
station_ids <- rain_tidy %>% distinct(STATION)
month_span <- seq(from = 1, to = 12)
year_span <- seq(from = 2008, to = 2013)
rain_final <- crossing(station_ids, month_span, year_span) %>%
rename(month = month_span, year = year_span)
rain_final
## # A tibble: 504 x 3
## STATION month year
## <chr> <int> <int>
## 1 COOP:180465 1 2008
## 2 COOP:180465 1 2009
## 3 COOP:180465 1 2010
## 4 COOP:180465 1 2011
## 5 COOP:180465 1 2012
## 6 COOP:180465 1 2013
## 7 COOP:180465 2 2008
## 8 COOP:180465 2 2009
## 9 COOP:180465 2 2010
## 10 COOP:180465 2 2011
## # … with 494 more rows
Now let’s compute what would have been our final table.
final_with_missing <- rain_tidy %>%
filter(not_missing) %>%
group_by(STATION, year(start_date), month(start_date)) %>%
summarize(avg = mean(HPCP)) %>%
rename(year = `year(start_date)`, month = `month(start_date)`)
final_with_missing
## # A tibble: 305 x 4
## # Groups: STATION, year [35]
## STATION year month avg
## <chr> <dbl> <dbl> <dbl>
## 1 COOP:180015 2008 5 3.32
## 2 COOP:180015 2008 6 3.81
## 3 COOP:180015 2008 7 5.41
## 4 COOP:180015 2008 8 3.67
## 5 COOP:180015 2008 9 5.17
## 6 COOP:180015 2008 10 2.72
## 7 COOP:180015 2008 11 3.29
## 8 COOP:180015 2008 12 2.94
## 9 COOP:180015 2009 1 2.83
## 10 COOP:180015 2009 2 2.54
## # … with 295 more rows
The code above computes the monthly mean (average) precipitation for the hours that rained for each station.
As you can see, there are only 305 rows! That is so much less than 504! We are missing much data. However, I am in too deep and we press forward.
There would be no way to tell which combinations were missing had we not generated the “skeleton”. Below I perform a left join to make the implicitly missing data, explicit.
rain_final <- rain_final %>%
left_join(final_with_missing,
by=c("STATION" = "STATION",
"year" = "year",
"month" = "month"))
rain_final
## # A tibble: 504 x 4
## STATION month year avg
## <chr> <dbl> <dbl> <dbl>
## 1 COOP:180465 1 2008 0.81
## 2 COOP:180465 1 2009 1.12
## 3 COOP:180465 1 2010 1.02
## 4 COOP:180465 1 2011 1.47
## 5 COOP:180465 1 2012 NA
## 6 COOP:180465 1 2013 1.51
## 7 COOP:180465 2 2008 1.44
## 8 COOP:180465 2 2009 0.505
## 9 COOP:180465 2 2010 1.44
## 10 COOP:180465 2 2011 1.34
## # … with 494 more rows
As you can see, station COOP:180465 had missing data on January 2012. Now all we have to do is impute the missing data.
First I will perform another left join so we can refer to the “global” month average. Remember we computed the mean_df before? Now we use it.
rain_final <- rain_final %>%
left_join(mean_df %>% rename(global_avg = avg),
by = c("STATION" = "STATION", "month" = "month"))
rain_final %>% head()
## # A tibble: 6 x 5
## STATION month year avg global_avg
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 COOP:180465 1 2008 0.81 1.19
## 2 COOP:180465 1 2009 1.12 1.19
## 3 COOP:180465 1 2010 1.02 1.19
## 4 COOP:180465 1 2011 1.47 1.19
## 5 COOP:180465 1 2012 NA 1.19
## 6 COOP:180465 1 2013 1.51 1.19
We’re in the endgame now! The final step is easy. We just need to replace if the avg is NA.
rain_final <- rain_final %>% mutate(avg = if_else(is.na(avg), global_avg, avg))
rain_final %>% head()
## # A tibble: 6 x 5
## STATION month year avg global_avg
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 COOP:180465 1 2008 0.81 1.19
## 2 COOP:180465 1 2009 1.12 1.19
## 3 COOP:180465 1 2010 1.02 1.19
## 4 COOP:180465 1 2011 1.47 1.19
## 5 COOP:180465 1 2012 1.19 1.19
## 6 COOP:180465 1 2013 1.51 1.19
As you can see above, we’ve essentially “borrowed” the value over from global_avg when avg was NA. For sanity check, let’s make sure we still have 504 rows in our final table!
rain_final %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 504
What I became immediately curious about was how these stations were distributed across Maryland. I provide a map below.
map <- ortho_df %>%
distinct(lat, lng, `Station ID`) %>%
leaflet() %>%
addTiles() %>%
addMarkers(lng = ~lng,
lat = ~lat,
popup = ~`Station ID`,
clusterOptions = markerClusterOptions())
map
This distribution looks good! It seems we are covering many of the waters in Maryland.
One thing I noticed that was peculiar was that just a few stations were at the same. It was very subtle but I’m glad I caught it! I investigate this below.
ortho_df %>%
distinct(`Location 1`, `Station ID`) %>%
group_by(`Location 1`) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 261 x 2
## `Location 1` count
## <chr> <int>
## 1 "MARYLAND\n(38.583333, -77.157778)" 4
## 2 "MARYLAND\n(39.359167, -76.370556)" 3
## 3 "MARYLAND\n(37.966389, -76.018333)" 1
## 4 "MARYLAND\n(37.970278, -75.858333)" 1
## 5 "MARYLAND\n(37.989167, -76.041389)" 1
## 6 "MARYLAND\n(38.0625, -76.3325)" 1
## 7 "MARYLAND\n(38.083333, -75.583333)" 1
## 8 "MARYLAND\n(38.09, -75.42)" 1
## 9 "MARYLAND\n(38.094167, -75.799444)" 1
## 10 "MARYLAND\n(38.149722, -76.439167)" 1
## # … with 251 more rows
As you can see above, there are 4 “different stations” occupying the same geolocation!
Below, we can see the Station IDs for these peculiar stations.
ortho_df %>%
filter(`Location 1` == "MARYLAND\n(38.583333, -77.157778)" |
`Location 1` == "MARYLAND\n(39.359167, -76.370556)") %>%
distinct(`Station ID`, `Location 1`)
## # A tibble: 7 x 2
## `Station ID` `Location 1`
## <chr> <chr>
## 1 MD0001201 "MARYLAND\n(39.359167, -76.370556)"
## 2 MD0003158-OUTFALL 025 "MARYLAND\n(38.583333, -77.157778)"
## 3 MD0003158-OUTFALL 010 "MARYLAND\n(38.583333, -77.157778)"
## 4 MD0021555-OUTFALL 001 "MARYLAND\n(39.359167, -76.370556)"
## 5 MD0003158-OUTFALL 085 "MARYLAND\n(38.583333, -77.157778)"
## 6 MD0003158-OUTFALL 021 "MARYLAND\n(38.583333, -77.157778)"
## 7 MD0021555-OUTFALL 002 "MARYLAND\n(39.359167, -76.370556)"
I looked up outfall. Basically it is water leaving a facility. My best guess is that these are for the multiple ways water can exit for some facility. I will worry about this when I have to (later).
I will do some unsupervised learning with k-means clustering. I would like to know what regions are similar in their orthophosphate levels.
Note when doing clustering it is very important to scale the data so that the scales for all variables are similar. Let’s do that first.
Below, I first get a yearly average for all the unique geolocations.
lat_lng_year_avg <- ortho_df %>%
mutate(the_year = year(the_date)) %>%
group_by(the_year, lat, lng) %>%
summarize(region_year_avg = mean(`Result Value`))
lat_lng_year_avg %>% head()
## # A tibble: 6 x 4
## # Groups: the_year, lat [6]
## the_year lat lng region_year_avg
## <dbl> <dbl> <dbl> <dbl>
## 1 2008 38.0 -76.0 2.52
## 2 2008 38.0 -75.9 0.34
## 3 2008 38.0 -76.0 2.52
## 4 2008 38.1 -76.3 2.52
## 5 2008 38.1 -75.6 0.102
## 6 2008 38.1 -75.4 0.0117
Now we can standardize within each year across the geolocations. Below, I perform this operation.
cluster_this <- lat_lng_year_avg %>%
group_by(the_year) %>%
mutate(year_mean = mean(region_year_avg),
year_stdev = sd(region_year_avg)) %>%
ungroup() %>%
mutate(std_val = (region_year_avg - year_mean) / year_stdev)
cluster_this %>% head()
## # A tibble: 6 x 7
## the_year lat lng region_year_avg year_mean year_stdev std_val
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 38.0 -76.0 2.52 1.83 1.74 0.396
## 2 2008 38.0 -75.9 0.34 1.83 1.74 -0.857
## 3 2008 38.0 -76.0 2.52 1.83 1.74 0.396
## 4 2008 38.1 -76.3 2.52 1.83 1.74 0.396
## 5 2008 38.1 -75.6 0.102 1.83 1.74 -0.994
## 6 2008 38.1 -75.4 0.0117 1.83 1.74 -1.05
Next, we scale the geolocation. In this case I compute a non-weighted mean. This is because there is no real significance of one location having more data.
In other words, we are simply trying to do a scaling operation to make the clustering work better.
Below I compute the non-weighted mean for location using distinct.
lat_stats <- cluster_this %>% distinct(lat) %>% summarize(avg = mean(lat), stdev = sd(lat))
lng_stats <- cluster_this %>% distinct(lng) %>% summarize(avg = mean(lng), stdev = sd(lng))
cluster_this <- cluster_this %>%
mutate(std_lat = (lat - lat_stats$avg) / lat_stats$stdev,
std_lng = (lng - lng_stats$avg) / lng_stats$stdev)
Let’s visualize the data. In the plot below, I plot based on geolocation and then add color to indicate the values.
cluster_this %>%
ggplot(mapping = aes(x = std_lat, y = std_lng, color = std_val)) +
geom_point(alpha = 0.3) +
facet_wrap(~the_year, ncol = 3) +
scale_color_gradient(low = "black", high = "red") +
labs(y = "Standardized Longitude", x = "Standardized Latitude",
color = "Std. Value",
title = "Standardized Values against Geolocation")
It’s faint, but there are some red circles. And woah! It seems we have data 10 standard deviations away from the mean!
cluster_this %>%
# here I plot the normal points
filter(std_val < 2) %>%
ggplot(mapping = aes(x = std_lat, y = std_lng)) +
geom_point(alpha = 0.3) +
# here I plot the exceptions
geom_point(data = cluster_this %>% filter(std_val >= 2),
mapping = aes(size = std_val),
color = "red") +
facet_wrap(~the_year, ncol = 3) +
labs(y = "Standardized Longitude", x = "Standardized Latitude",
color = "Std. Value",
title = "Standardized Values against Geolocation with Exceptions Highlighted")
What could possibly explain these exceptions? I wonder if there may be error in my computation or data.
Below, I compute the summary statistics.
summary(lat_lng_year_avg$region_year_avg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2017 1.3833 1.5925 2.5200 19.3227
That very large max value explains it! It does not seem like this is an error. From the shape of Maryland, and the shape of the graph, it seems like the measurment is near Washington, D.C. This could be something like the Potomac River. Maybe that’s why the levels are high?
Below I run the clustering algorithm for k parameters 1 through 10.
set.seed(1234)
cluster_res <- cluster_this %>%
# we only want to cluster on these three attributes
select(std_lat, std_lng, std_val) %>%
cluster::clusGap(stats::kmeans, K.max=10)
cluster_res
## Clustering Gap statistic ["clusGap"] from call:
## cluster::clusGap(x = ., FUNcluster = stats::kmeans, K.max = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 1
## logW E.logW gap SE.sim
## [1,] 6.703889 7.522039 0.8181507 0.008482272
## [2,] 6.462340 7.133386 0.6710458 0.007592232
## [3,] 6.297964 6.997406 0.6994417 0.016910665
## [4,] 6.230786 6.893768 0.6629815 0.008491294
## [5,] 6.107800 6.796320 0.6885209 0.007084830
## [6,] 6.022372 6.706946 0.6845739 0.006845093
## [7,] 5.973502 6.648053 0.6745519 0.006307008
## [8,] 5.834772 6.599722 0.7649499 0.010465435
## [9,] 5.788874 6.563810 0.7749353 0.007963834
## [10,] 5.736389 6.534534 0.7981450 0.006800404
We can see an output table. Let’s visualize this. First we will need to tidy the results.
# get the table and convert it to a tidy tibble
result <- cluster_res$Tab %>%
as_tibble() %>%
# make the row's id into a attribute itself so we can graph it
rowid_to_column("k")
Below I graph the gap statistic for each k value.
ggplot(result, mapping = aes(x = k, y = gap)) +
geom_point() +
geom_line() +
geom_errorbar(mapping = aes(ymin = gap - SE.sim, ymax = gap + SE.sim)) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
labs(y = "Gap Statistic", x = "k Parameter",
title = "Gap Statistic vs k Parameter")
From this plot it would seem that the optimal number of clusters is 1. For more information on gap statistics and cluster see in the references (Tibshirani, Walther, and Hastie 2002). Let’s double check this.
cluster::maxSE(cluster_res$Tab[ ,"gap"], cluster_res$Tab[ ,"SE.sim"],
method = "Tibs2001SEmax",
SE.factor = 1)
## [1] 1
Wow, this clustering really failed. This result was against my intuition since I expected areas that are close to have similar water contents. Intuitively, these areas would be very likely to be clustered together since all three attributes (latitude, longitude, orthophosphate levels) would be close.
I would like to return back to the topic of the distribution of the values. There seemed to be some extreme exceptions.
lat_lng_year_avg %>%
ggplot() +
geom_boxplot(mapping = aes(x = the_year, y = region_year_avg, group = the_year)) +
labs(y = "Average Orthophosphate (mg/l)", x = "Year",
title = "Distribution of Average Orthophosphate by Year")
In the above plot we can see the distribution of orthophosphate levels for each year across location. We can see that there are some clear outliers.
Below I show a map for the stations that collect precipitation data.
rain_map <- rain_tidy %>%
distinct(LATITUDE, LONGITUDE) %>%
leaflet() %>%
addTiles() %>%
addMarkers(lng = ~LONGITUDE, lat = ~LATITUDE, clusterOptions = markerClusterOptions())
rain_map
Interestingly we get 10 unique geolocations yet there are only 7 stations. Below I output the unique geolocations.
rain_tidy %>%
distinct(LATITUDE, LONGITUDE)
## # A tibble: 10 x 2
## LATITUDE LONGITUDE
## <dbl> <dbl>
## 1 39.2 -76.7
## 2 39.0 -76.9
## 3 39.0 -76.9
## 4 39.3 -76.6
## 5 39.5 -76.2
## 6 39.5 -76.2
## 7 38.7 -75.8
## 8 39.7 -76.8
## 9 39.5 -79.1
## 10 39.5 -79.1
As you can see above, there appears to be some rounding errors. This will not be a problem as our final table uses only the station IDs.
The following graph shows the average hourly rain accumulation through time.
rain_final %>%
mutate(the_date =
# I kinda had to hack this date conversion
# To get a date object, I compute a period and then add to year 0.
as.Date(years(year) + months(month - 1),
origin = "0000-01-01")) %>%
ggplot() +
geom_line(mapping = aes(x = the_date, y = avg, color = STATION)) +
labs(y = "Average Hourly Precipitation (Tenth Millimeter)", x = "Date",
title = "Average Hourly Precipitation vs Time")
Note that “Average Hourly Precipitation” is for average for hours when it did rain.
It seems that the areas in Maryland get shifted levels of rainfall but they all follow a general trend. This graph is a little hard to see, let’s facet this.
rain_final %>%
mutate(the_date =
# I kinda had to hack this date conversion
# To get a date object, I compute a period and then add to year 0.
as.Date(years(year) + months(month - 1),
origin = "0000-01-01")) %>%
ggplot() +
geom_line(mapping = aes(x = the_date, y = avg)) +
facet_wrap(~STATION, ncol = 4) +
theme(axis.text.x = element_text(angle = 25, hjust = 0.75)) +
labs(y = "Average Hourly Precipitation (Tenth Millimeter)", x = "Date",
title = "Average Hourly Precipitation vs Time")
We can see there are roughly 6 spikes in rain for each stations This makes sense since this are 6 years of data and there are “rainy” months/seasons.
I will be showing you how to prepare the data.
Our features will be:
Our target is the orthophosphate level for the latest month.
Note that I want to establish a “causality” between rain and orthophosphate (in some sense). So I actually want the previous month of rain to be paired with each orthophosphate level. The orthophosphate is sampled at the beginning of each month, so we can almost think of this as the “result” of the previous month.
ortho_df <- ortho_df %>%
mutate(binding_date = (the_date - months(1))) %>%
# these make it easier to do a join as we'll see
mutate(binding_month = month(binding_date)) %>%
mutate(binding_year = year(binding_date))
Let’s get to splitting the data into our predictor data and outcome (target) data.
outcome_cutoff = ortho_df %>% summarize(max_date = max(the_date))
outcome_cutoff <- outcome_cutoff$max_date[1] - years(1)
outcome_cutoff
## [1] "2012-06-01 UTC"
Now we need to do some processing on rain final. I want each row to be for a particular month. So we will need to first combine the station ID and year. Then will use the spread function.
rain_join_this <- rain_final %>%
# unite STATION and year with a "." in between, column name is spread_key
unite("spread_key", STATION, year, sep = ".") %>%
select(-global_avg) %>%
spread(spread_key, avg)
rain_join_this
## # A tibble: 12 x 43
## month `COOP:180015.20… `COOP:180015.20… `COOP:180015.20…
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2.83 2.83 2.83
## 2 2 2.54 2.54 2.54
## 3 3 2.78 2.79 2.78
## 4 4 3.81 2.54 3.95
## 5 5 3.32 3.75 3.51
## 6 6 3.81 3.39 4.99
## 7 7 5.41 4.87 5.20
## 8 8 3.67 10.7 9.03
## 9 9 5.17 3.64 4.39
## 10 10 2.72 4.21 3.84
## 11 11 3.29 2.96 3.16
## 12 12 2.94 5.01 3.67
## # … with 39 more variables: `COOP:180015.2011` <dbl>,
## # `COOP:180015.2012` <dbl>, `COOP:180015.2013` <dbl>,
## # `COOP:180465.2008` <dbl>, `COOP:180465.2009` <dbl>,
## # `COOP:180465.2010` <dbl>, `COOP:180465.2011` <dbl>,
## # `COOP:180465.2012` <dbl>, `COOP:180465.2013` <dbl>,
## # `COOP:180700.2008` <dbl>, `COOP:180700.2009` <dbl>,
## # `COOP:180700.2010` <dbl>, `COOP:180700.2011` <dbl>,
## # `COOP:180700.2012` <dbl>, `COOP:180700.2013` <dbl>,
## # `COOP:183090.2008` <dbl>, `COOP:183090.2009` <dbl>,
## # `COOP:183090.2010` <dbl>, `COOP:183090.2011` <dbl>,
## # `COOP:183090.2012` <dbl>, `COOP:183090.2013` <dbl>,
## # `COOP:185718.2008` <dbl>, `COOP:185718.2009` <dbl>,
## # `COOP:185718.2010` <dbl>, `COOP:185718.2011` <dbl>,
## # `COOP:185718.2012` <dbl>, `COOP:185718.2013` <dbl>,
## # `COOP:185934.2008` <dbl>, `COOP:185934.2009` <dbl>,
## # `COOP:185934.2010` <dbl>, `COOP:185934.2011` <dbl>,
## # `COOP:185934.2012` <dbl>, `COOP:185934.2013` <dbl>,
## # `COOP:188065.2008` <dbl>, `COOP:188065.2009` <dbl>,
## # `COOP:188065.2010` <dbl>, `COOP:188065.2011` <dbl>,
## # `COOP:188065.2012` <dbl>, `COOP:188065.2013` <dbl>
We will need to also get the orthophosphate data in the right format. If all goes well the below code should work.
predictor_df <- ortho_df %>%
filter(the_date < outcome_cutoff) %>%
select(`Station ID`, `Result Value`, binding_month, binding_year) %>%
unite("station_month", `Station ID`, binding_month, sep = ".")# %>%
# spread(binding_year, `Result Value`)
Oh no! As you can see I commented out code to show what we would have done. There was a unique key error. In other words, the station_month, binding year tuple is not unique! There are stations that double recorded!
predictor_df %>%
group_by(station_month, binding_year) %>%
summarize(samples = n()) %>%
arrange(desc(samples))
## # A tibble: 11,453 x 3
## # Groups: station_month [3,024]
## station_month binding_year samples
## <chr> <dbl> <int>
## 1 MD0000311.12 2009 2
## 2 MD0000311.1 2009 1
## 3 MD0000311.1 2010 1
## 4 MD0000311.1 2011 1
## 5 MD0000311.1 2012 1
## 6 MD0000311.10 2008 1
## 7 MD0000311.10 2009 1
## 8 MD0000311.10 2010 1
## 9 MD0000311.10 2011 1
## 10 MD0000311.11 2008 1
## # … with 11,443 more rows
In the above output we can see that “MD0000311” is the offending station.
Below I investigate further.
predictor_df %>% filter(station_month == "MD0000311.12", binding_year == "2009")
## # A tibble: 2 x 3
## station_month `Result Value` binding_year
## <chr> <dbl> <dbl>
## 1 MD0000311.12 0.08 2009
## 2 MD0000311.12 0.05 2009
Let’s look back to the original data and see if maybe these samples were taken on different days.
ortho_df %>% filter(`Station ID` == "MD0000311",
binding_month == 12,
binding_year == "2009") %>%
select(the_date)
## # A tibble: 2 x 1
## the_date
## <dttm>
## 1 2010-01-01 00:00:00
## 2 2010-01-01 00:00:00
So these are conflicting measurements on the same day! The simplest solution is just to do an average.
predictor_df <- predictor_df %>%
group_by(station_month, binding_year) %>%
summarize(avg_value = mean(`Result Value`)) %>%
spread(binding_year, avg_value)
predictor_df
## # A tibble: 3,024 x 6
## # Groups: station_month [3,024]
## station_month `2008` `2009` `2010` `2011` `2012`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MD0000311.1 NA 0.04 0.08 0.073 0.05
## 2 MD0000311.10 0.07 0.06 0.073 0.142 NA
## 3 MD0000311.11 0.03 0 0.09 0.142 NA
## 4 MD0000311.12 0.1 0.065 0.073 0.05 NA
## 5 MD0000311.2 NA 0.02 0 0.073 0.263
## 6 MD0000311.3 NA 0.06 0 0.073 0.064
## 7 MD0000311.4 NA 0.02 0 0.073 0.092
## 8 MD0000311.5 NA 0.02 0 0.073 NA
## 9 MD0000311.6 0.08 0.04 0.05 0.213 NA
## 10 MD0000311.7 0.12 0.16 0.07 0.071 NA
## # … with 3,014 more rows
We can see there is much data missing.
Below, I compute a list of how many attributes are missing for each.
attributes_missing <- rowSums(is.na(predictor_df))
attributes_missing %>% head()
## [1] 1 1 1 1 1 1
We can see below that there is no perfect row (without any data missing). This is unfortunate.
min(attributes_missing)
## [1] 1
There should be a total of about 15 thousand attributes total (3000 * 5). We can see below that over 20% of the data is missing.
sum(attributes_missing)
## [1] 3667
The future step could be to impute by using perhaps the mean.
Now let’s join the two datasets! (The outcome data still needs to be made though.)
joined_data <- predictor_df %>%
separate("station_month", c("station_id", "month"), sep = "\\.") %>%
mutate(month = as.numeric(month)) %>%
left_join(rain_join_this, by = "month")
joined_data
## # A tibble: 3,024 x 49
## station_id month `2008` `2009` `2010` `2011` `2012` `COOP:180015.20…
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MD0000311 1 NA 0.04 0.08 0.073 0.05 2.83
## 2 MD0000311 10 0.07 0.06 0.073 0.142 NA 2.72
## 3 MD0000311 11 0.03 0 0.09 0.142 NA 3.29
## 4 MD0000311 12 0.1 0.065 0.073 0.05 NA 2.94
## 5 MD0000311 2 NA 0.02 0 0.073 0.263 2.54
## 6 MD0000311 3 NA 0.06 0 0.073 0.064 2.78
## 7 MD0000311 4 NA 0.02 0 0.073 0.092 3.81
## 8 MD0000311 5 NA 0.02 0 0.073 NA 3.32
## 9 MD0000311 6 0.08 0.04 0.05 0.213 NA 3.81
## 10 MD0000311 7 0.12 0.16 0.07 0.071 NA 5.41
## # … with 3,014 more rows, and 41 more variables: `COOP:180015.2009` <dbl>,
## # `COOP:180015.2010` <dbl>, `COOP:180015.2011` <dbl>,
## # `COOP:180015.2012` <dbl>, `COOP:180015.2013` <dbl>,
## # `COOP:180465.2008` <dbl>, `COOP:180465.2009` <dbl>,
## # `COOP:180465.2010` <dbl>, `COOP:180465.2011` <dbl>,
## # `COOP:180465.2012` <dbl>, `COOP:180465.2013` <dbl>,
## # `COOP:180700.2008` <dbl>, `COOP:180700.2009` <dbl>,
## # `COOP:180700.2010` <dbl>, `COOP:180700.2011` <dbl>,
## # `COOP:180700.2012` <dbl>, `COOP:180700.2013` <dbl>,
## # `COOP:183090.2008` <dbl>, `COOP:183090.2009` <dbl>,
## # `COOP:183090.2010` <dbl>, `COOP:183090.2011` <dbl>,
## # `COOP:183090.2012` <dbl>, `COOP:183090.2013` <dbl>,
## # `COOP:185718.2008` <dbl>, `COOP:185718.2009` <dbl>,
## # `COOP:185718.2010` <dbl>, `COOP:185718.2011` <dbl>,
## # `COOP:185718.2012` <dbl>, `COOP:185718.2013` <dbl>,
## # `COOP:185934.2008` <dbl>, `COOP:185934.2009` <dbl>,
## # `COOP:185934.2010` <dbl>, `COOP:185934.2011` <dbl>,
## # `COOP:185934.2012` <dbl>, `COOP:185934.2013` <dbl>,
## # `COOP:188065.2008` <dbl>, `COOP:188065.2009` <dbl>,
## # `COOP:188065.2010` <dbl>, `COOP:188065.2011` <dbl>,
## # `COOP:188065.2012` <dbl>, `COOP:188065.2013` <dbl>
In conclusion, the data is a large part of the effort. I underestimated the time it would take to wrangle the data.
One of the biggest bottlenecks is missing data. So unfortunately, I did not get to answering the question of can we predict the orthophosphate level.
I hope that this “tutorial” helps someone and perhaps someone will do supervised machine learning with the processed data.
Chislock, Michael F., E. Doster, R. A. Zitomer, and A. E. Wilson. 2013. “Eutrophication: Causes, Consequences, and Controls in Aquatic Ecosystems.” Nature Education Knowledge. https://www.nature.com/scitable/knowledge/library/eutrophication-causes-consequences-and-controls-in-aquatic-102364466.
“Hourly Precipitation Data Documentation.” 2016. National Oceanic and Atmospheric Administration. https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/PRECIP_HLY_documentation.pdf.
“Hydrologic Unit Maps.” 2018. USGS Water Resources of the United States. U.S. Department of Interior | U.S. Geological Survey. https://water.usgs.gov/GIS/huc.html.
Shumway, Sandra E. 1990. “A Review of the Effects of Algal Blooms on Shellfish and Aquaculture.” Journal of the World Aquaculture Society 21 (2): 65–104. doi:10.1111/j.1749-7345.1990.tb00529.x.
Soltis-Muth, Cheryl. n.d. “TP - It Stands for Total Phosphate.” http://www.ohiowea.org/docs/OWEA_Phosphorus_CSM.pdf.
“The Elements of Geodesy: The Horizontal Datum.” 2004. NOAA National Ocean Service Education: Geodesy. US Department of Commerce | National Oceanic and Atmospheric Administration. https://oceanservice.noaa.gov/education/kits/geodesy/geo05_horizdatum.html.
Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2002. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2): 411–23. doi:10.1111/1467-9868.00293.
Zhang, Feng, Jiyoung Lee, Song Liang, and Ck Shum. 2015. “Cyanobacteria Blooms and Non-Alcoholic Liver Disease: Evidence from a County Level Ecological Study in the United States.” Environmental Health 14 (1). doi:10.1186/s12940-015-0026-7.