Final Project

Introduction

The Question

Can we use precipitation data to predict the orthophosphate levels in bodies of water?

Motivation

Orthophosphate

You are probably wondering:

What is orthophosphate? (Sounds like a mouthful!)
Why should I care about the orthophosphate levels in water?

Orthophosphate is known as reactive phosphate and is a main ingredient in fertilizer. Basically, this stuff makes living stuff grow; it’s a nutrient that is “readily utilized by biota”. When it rains, orthophosphates can get into bodies of water through run-off (Soltis-Muth, n.d.).

In biology terms, an excess of nutrients can lead to eutrophication. You should care because this phenomenon is something bad for the environment and humans.

Eutrophication

Wth great nutrients comes a whole lot of algae. (And responsibility as I’ll explain!)

Wow, another big word in the title! Here is a brief simplified overview of what eutrophication is.

Nutrients causes growth of a lot of plants and algae who then die.
Bacteria decomposing the dead stuff use up all the oxygen which then suffocates animals in the water.

There are many other problems such as an increase in pH that is associated with algae blooms (Chislock et al. 2013).

Let’s backtrack. Nutrients in run-off that came from fertilizers played a part in this catastrophe. So be careful, mindful, and responsible when you use fertilizers!

Still not convinced to care? I conceed that I have not discussed how this affects humans.

If you eat any kind of shellfish, you should definitely care. Some algae have toxins that make it into shellfish! (Shumway 1990)
While cyanobacteria are not algae, it can be involved in a similar process with cyanobacteria blooms. This has been shown to cause an increase in “non-alchoholic liver disease” in humans! (Zhang et al. 2015)

What can we do?

We have concluded that algae blooms can be bad for the environment and humans.
We know there is a link between orthophosphate and algae growth.
We know orthophosphate is found in fertilizers and run-off happens during rain.

My motivation for investigating The Question is that: if we can predict orthophosphate levels using weather, perhaps we can limit the damaging effects of these blooms.

Data Acquisition

Finding Data

Getting data is no easy task for data scientists.

Often times if one has data, one has to clean or scrape that data. Other times one does not have data and has to spend hours finding suitable data. (Sometimes it is both.)

For this task, I spent many hours (possibly 6 or more) before I found data that was both good quality and was related to a good question!

At first I looked across different dataset repositories such as https://www.data.gov. I did not have any luck until I came across AWS open data: https://registry.opendata.aws.

Some candidate datasets I considered using were:

eBird data
- https://registry.opendata.aws/ebirdst/
- https://cornelllabofornithology.github.io/ebirdst/
- This is some data on birds.
- This would be a good topic since birds play a very important role in the environment. They play a role in pollinating.
- I found that this data was ready to use as a package was available in R already.
- Some issues I ran into were that the data seemed mostly to be raster/image data. I was unsure how to manipulate these.
- Also, the numerical data seemed to be related to model accuracy for some pre-defined model. Since I would like to analyze data itself, this would not be very helpful.
Finnish (and HIRLAM) data
- https://registry.opendata.aws/hirlam/
- https://en.ilmatieteenlaitos.fi/open-data-on-aws-s3
- From these links I eventually found some vague description of that the Finnish Meteorological Institute had time series data. That got me very excited.
- After some digging, I found their catalog of data.
- http://catalog.fmi.fi/geonetwork/srv/eng/catalog.search#/home
- I mostly found dense XML data and was left confused.
- Their website was, of course, primarily in Finnish. Setting the language to English and using Google Translate only helped a little. Maybe I should have looked at the data on AWS instead.
Multimedia Commons (YFCC100M)
- https://registry.opendata.aws/multimedia-commons/
- https://multimediacommons.wordpress.com/yfcc100m-core-dataset/
- The Multimedia Commons is an add-on dataset to the YFCC dataset. It provides pre-extracted features for images and audio.
- We did not learn much about image processing. So this would be a great dataset to use as the features are ready to use. (They’re actually still working on finishing the dataset.)
- However, the original YFCC dataset required some kind of request for it. Since I did not want to wait, I did not choose this dataset.
- Another issue was that many extracted features were in some LIRE format. I found it hard to understand the documentation.
- And for features like SIFT not in this format, I struggled to find R libraries for image related machine learning. I would also have had to write a binary file parser in R.

Although AWS open data is a great place for data, the amount became overwhelming; I spent much of my time looking there. Unfortunately, I did not find any data of my liking.

These were just some of the datasets I found not suitable.

The Data

I decided to search a bit more local. Perhaps the county or state government would have some good data. Indeed I found good data on Maryland’s open data: https://opendata.maryland.gov.

I ended up settling with the Water Point Source Sample data:

https://opendata.maryland.gov/Energy-and-Environment/Water-Point-Source-Sample-Results/eqs6-savc

Now all I needed was precipitation data. Since I now knew exactly what I was looking for, I found it pretty quickly. I will also be using data from NOAA’s Climate Data Online:

https://www.ncdc.noaa.gov/cdo-web/

https://www.ncdc.noaa.gov/cdo-web/datasets

The documentation can be found here:

https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/PRECIP_HLY_documentation.pdf

I will have included both datasets in my Github repository. If you wish to go through the process yourself, here are the details.

This is a free dataset but we still have to “order” it. The steps to obtain this dataset are as follows:

Go to the search tool: https://www.ncdc.noaa.gov/cdo-web/search. Enter the following into the search form:
- Precipitation Hourly
- 2008-01-01 to 2013-12-31
- States
- Maryland
Submitting will take you to a map. Find the “add to cart” button on the left sidebar.
Hover over the cart on the right upper hand corner and click “go to cart”.
Check everything is in order and click continue until prompted for station details/flags. Since more data cannot hurt, check all the options (station name, geographic location, and data flags). Make sure to switch the units from standard to metric. (This actually requires some knowledge ahead of time. Our primary data is in metric.)
You definitely should check Precipitation for data types. Click continue.
Again, check everything is in order, enter your email address and submit.
You will receive a link to download the data.

Your parameters should look something like the following.

Working With Data

Now that we have found the data, it is a good idea to load them. It will help in describing the data when the metadata is not clear.

Required Packages

First, I will define what packages you’ll need. It is helpful to know this at the start.

# tibble related libraries
library(tidyverse)
library(tidyr)
# date processing
library(lubridate)
# string manipulation
library(stringr)
# map capabilities
library(leaflet)

Loading Data

Water Point Source Sample

# do not worry about this, explanation is in the data description
m_types <- cols(
  `Org Name` = col_character(),
  `Station ID` = col_character(),
  County = col_character(),
  HUC = col_double(),
  `Station Horizontal Datum` = col_character(),
  `Activity ID` = col_character(),
  `Activity Start` = col_character(),
  `Activity Medium` = col_character(),
  `Activity Type` = col_character(),
  `Activity Category-Rep Num` = col_character(),
  `Characteristic Name` = col_character(),
  `Sample Fraction` = col_character(),
  `Value Type` = col_character(),
  `Result Value Status` = col_character(),
  `Result Value` = col_double(),
  Units = col_character(),
  `Analytical Proc ID` = col_character(),
  `Location 1` = col_character(),
  Counties = col_double()
)

water_df <- read_csv("Water_Point_Source_Sample_Results.csv",
                     col_types = m_types) %>%
  # parse the strings into date objects
  # format is month day year hours:minutes:seconds
  mutate(the_date = mdy_hms(`Activity Start`))

NOAA Precipitation

# do not worry about this, explanation is in the data description
m_types <- cols(
  STATION = col_character(),
  STATION_NAME = col_character(),
  ELEVATION = col_double(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  # here the data is in this format: year-month-day hours:minutes:seconds
  DATE = col_datetime(format = "%Y%m%d %H:%M"),
  HPCP = col_double(),
  `Measurement Flag` = col_character(),
  `Quality Flag` = col_character()
)

rain_df <- read_csv("precipitation.csv", col_types = m_types)

Description of Dataset

I am using two datasets. The primary data (with our “target” orthophosphate attribute) is Water Point Source Sample data and the auxillary data is NOAA Precipitation data.

Water Point Source Sample

This biggest issue with this dataset is that the metadata only includes the suggested data type (that may or may not be correct). The only two types listed are “text” and “number”.

Without any description, I have to either infer or do analysis to gain more context. We will see below that there is much pre-analysis even before exploratory data analysis.

water_df %>% head()

## # A tibble: 6 x 20
##   `Org Name` `Station ID` County    HUC `Station Horizo… `Activity ID`
##   <chr>      <chr>        <chr>   <dbl> <chr>            <chr>        
## 1 Maryland … MD0021598    ALLEG… 2.07e6 NAD83            FY12-PS903   
## 2 Maryland … MD0021598    ALLEG… 2.07e6 NAD83            FY12-PS903   
## 3 Maryland … MD0021598    ALLEG… 2.07e6 NAD83            FY12-PS903   
## 4 Maryland … MD0021598    ALLEG… 2.07e6 NAD83            FY12-PS903   
## 5 Maryland … MD0021598    ALLEG… 2.07e6 NAD83            FY12-PS902   
## 6 Maryland … MD0021598    ALLEG… 2.07e6 NAD83            FY12-PS902   
## # … with 14 more variables: `Activity Start` <chr>, `Activity
## #   Medium` <chr>, `Activity Type` <chr>, `Activity Category-Rep
## #   Num` <chr>, `Characteristic Name` <chr>, `Sample Fraction` <chr>,
## #   `Value Type` <chr>, `Result Value Status` <chr>, `Result Value` <dbl>,
## #   Units <chr>, `Analytical Proc ID` <chr>, `Location 1` <chr>,
## #   Counties <dbl>, the_date <dttm>

As you can see below, there are 163,574 samples.

water_df %>% count()

## # A tibble: 1 x 1
##        n
##    <int>
## 1 163574

Below you can see that all samples are taken on the first day of every month for 5 years (There are 60 distinct dates and 60 / 12 = 5).

water_df %>%
  group_by(the_date) %>%
  summarize(samples = n())

## # A tibble: 60 x 2
##    the_date            samples
##    <dttm>                <int>
##  1 2008-07-01 00:00:00    2701
##  2 2008-08-01 00:00:00    2667
##  3 2008-09-01 00:00:00    2707
##  4 2008-10-01 00:00:00    2678
##  5 2008-11-01 00:00:00    2686
##  6 2008-12-01 00:00:00    2720
##  7 2009-01-01 00:00:00    2742
##  8 2009-02-01 00:00:00    2746
##  9 2009-03-01 00:00:00    2778
## 10 2009-04-01 00:00:00    2756
## # … with 50 more rows

Org Name

Data Type: Text

The organization name. This is not particularly interesting since all the samples are under “Maryland Point data”. I will definitely be dropping this column.

(This attribute name is somewhat of a misnomer since the value isn’t really an organization.)

water_df %>%
  group_by(`Org Name`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Org Name`          samples
##   <chr>                 <int>
## 1 Maryland Point data  163574

Station ID

Data Type: Nominal Categorical

The unique identifier of the station. There are a total of 272 stations.

water_df %>%
  group_by(`Station ID`) %>%
  summarize(samples = n()) %>%
  arrange(desc(samples))

## # A tibble: 272 x 2
##    `Station ID` samples
##    <chr>          <int>
##  1 MD0000311        780
##  2 MD0021687        780
##  3 MD0001775        768
##  4 MD0001422        756
##  5 MD0003221        660
##  6 MD0020001        660
##  7 MD0020010        660
##  8 MD0020052        660
##  9 MD0020095        660
## 10 MD0020168        660
## # … with 262 more rows

County

Data Type: Nominal Categorical

The county in Maryland that the station is located in.

You can see below the number of samples in the data for each county. There are some listed as unspecified that might need to be filtered out.

water_df %>%
  group_by(`County`) %>%
  summarize(samples = n()) %>%
  arrange(desc(samples))

## # A tibble: 24 x 2
##    County          samples
##    <chr>             <int>
##  1 FREDERICK         19994
##  2 CECIL             15488
##  3 ANNE ARUNDEL      13888
##  4 WASHINGTON        12820
##  5 CARROLL            9362
##  6 CHARLES            9209
##  7 GARRETT            9100
##  8 ALLEGANY           8362
##  9 BALTIMORE          7130
## 10 PRINCE GEORGE'S    6080
## # … with 14 more rows

HUC

Data Type: Nominal Categorical

(see “Hydrologic Unit Maps” 2018)

Hydrologic Unit Code
You can think of this as a zipcode but for bodies of water.
This code is the concatentation of multiple codes. For example, the first two digits signify the “major geographical region” the water body is in. (For this dataset, a zero may need to be padded at the beginning for this to be true.)

Station Horizontal Datum

Data Type: Nominal Categorical

(see “The Elements of Geodesy: The Horizontal Datum” 2004)

This is the version of the geodetic datum being used. This has something to do with the shape of the Earth.
The system is a “collection of specific points on the Earth”.
NAD27 is the outdated version and NAD83 is the newer version.

As you can see below, there are no stations that explicitly use NAD27 but there are some unknown entries that might need to be filtered out.

water_df %>%
  group_by(`Station Horizontal Datum`) %>%
  summarize(samples = n())

## # A tibble: 2 x 2
##   `Station Horizontal Datum` samples
##   <chr>                        <int>
## 1 NAD83                       163496
## 2 UNKWN                           78

Activity ID

Data Type: Nominal Categorical

This attribute was not documented well. So I will first provide some preliminary analysis for determining its purpose.

You can see below that the activity ID does not uniquely identify any particular observation. Recall there are 163,574 samples but here we only see 16,167 IDs.

water_df$`Activity ID` %>% n_distinct()

## [1] 16167

Let’s look at the most frequent activity IDs.

act_id_samples <- water_df %>%
  group_by(`Activity ID`) %>%
  summarize(samples = n())

act_id_samples %>%
  arrange(desc(samples)) %>%
  head()

## # A tibble: 6 x 2
##   `Activity ID` samples
##   <chr>           <int>
## 1 FY09-PS1009        13
## 2 FY09-PS1010        13
## 3 FY09-PS1011        13
## 4 FY09-PS1012        13
## 5 FY09-PS1013        13
## 6 FY09-PS1014        13

Well that didn’t tell us much. What about the least frequent?

tmp <- act_id_samples %>%
  arrange(samples) %>%
  head()

tmp

## # A tibble: 6 x 2
##   `Activity ID` samples
##   <chr>           <int>
## 1 FY09-PS100          1
## 2 FY09-PS102          1
## 3 FY09-PS104          1
## 4 FY09-PS105          1
## 5 FY09-PS109          1
## 6 FY09-PS110          1

I have saved the result in a variable. Since there is only one of each, I can safely look at the all the observations for each without being flooded with data.

water_df %>%
  # selects samples where the activity ID is in the table previously generated
  filter(`Activity ID` %in% tmp$`Activity ID`) %>%
  # these are not very useful so I excluded them
  select(-`Org Name`, -`Station Horizontal Datum`)

## # A tibble: 6 x 18
##   `Station ID` County    HUC `Activity ID` `Activity Start`
##   <chr>        <chr>   <dbl> <chr>         <chr>           
## 1 MD0003158-O… CHARL… 2.07e6 FY09-PS104    04/01/2009 12:0…
## 2 MD0003158-O… CHARL… 2.07e6 FY09-PS105    05/01/2009 12:0…
## 3 MD0003158-O… CHARL… 2.07e6 FY09-PS109    05/01/2009 12:0…
## 4 MD0003158-O… CHARL… 2.07e6 FY09-PS110    06/01/2009 12:0…
## 5 MD0003158-O… CHARL… 2.07e6 FY09-PS100    04/01/2009 12:0…
## 6 MD0003158-O… CHARL… 2.07e6 FY09-PS102    04/01/2009 12:0…
## # … with 13 more variables: `Activity Medium` <chr>, `Activity
## #   Type` <chr>, `Activity Category-Rep Num` <chr>, `Characteristic
## #   Name` <chr>, `Sample Fraction` <chr>, `Value Type` <chr>, `Result
## #   Value Status` <chr>, `Result Value` <dbl>, Units <chr>, `Analytical
## #   Proc ID` <chr>, `Location 1` <chr>, Counties <dbl>, the_date <dttm>

These samples look all exactly the same except for the station ID and activity ID. (Based on some attributes not shown.) We will see later that the “outfall” stations are again “troublemakers”.

I now believe the activity ID may be related to the station ID. Below I find the number of unique activity IDs each station uses.

water_df %>%
  group_by(`Station ID`) %>%
  summarize(num_unique = n_distinct(`Activity ID`))

## # A tibble: 272 x 2
##    `Station ID`          num_unique
##    <chr>                      <int>
##  1 MD0000311                     60
##  2 MD0000469                     60
##  3 MD0001201                     60
##  4 MD0001384                     60
##  5 MD0001422                     60
##  6 MD0001775                     60
##  7 MD0003158-OUTFALL 007         60
##  8 MD0003158-OUTFALL 010         61
##  9 MD0003158-OUTFALL 021         60
## 10 MD0003158-OUTFALL 025         60
## # … with 262 more rows

I know where I have seen 60 before! Recall this is the total number of unique sampling dates. There are some anomalies such as 36 or 61 unique activity IDs. (From data not shown.)

Just to make sure, I find the number of activity IDs used by stations in 2009 below. I am expecting each station to use 12 activity IDs in a year.

water_df %>%
  filter(year(the_date) == 2009) %>%
  group_by(`Station ID`) %>%
  summarize(num_unique = n_distinct(`Activity ID`))

## # A tibble: 270 x 2
##    `Station ID`          num_unique
##    <chr>                      <int>
##  1 MD0000311                     12
##  2 MD0000469                     12
##  3 MD0001201                     12
##  4 MD0001384                     12
##  5 MD0001422                     12
##  6 MD0001775                     12
##  7 MD0003158-OUTFALL 007         12
##  8 MD0003158-OUTFALL 010         12
##  9 MD0003158-OUTFALL 021         12
## 10 MD0003158-OUTFALL 025         12
## # … with 260 more rows

I picked 2009 based on that there was a full year of data. It seemed “safe” compared to 2008 which does not have a full year of data (the data starts July 2008).

At this point, I am almost satisfied. Let’s list what we observed:

Each activity ID is used anywhere from 1 to 13 times.
Different activity IDs could be the same “activity” with the same results.
Each station used roughly 60 activity IDs in the span of 5 years.
Each station used roughly 12 activity IDs in 2009.

I actually had first arrived at a false conclusion. I do not state it and to correct myself requires some future knowledge:

There are 13 characteristic names (types of measurements).

I pick an arbitrary station, the first of the previous output, and analyze how many activity IDs it used.

water_df %>%
  filter(year(the_date) == 2009, `Station ID` == "MD0000311") %>%
  group_by(`Activity ID`) %>%
  summarize(num_unique = n_distinct(`Characteristic Name`))

## # A tibble: 12 x 2
##    `Activity ID` num_unique
##    <chr>              <int>
##  1 FY09-PS13             13
##  2 FY09-PS14             13
##  3 FY09-PS15             13
##  4 FY09-PS16             13
##  5 FY09-PS17             13
##  6 FY09-PS18             13
##  7 FY10-PS19             13
##  8 FY10-PS20             13
##  9 FY10-PS21             13
## 10 FY10-PS22             13
## 11 FY10-PS23             13
## 12 FY10-PS24             13

The above output shows that for each of the 12 IDs the station used, each were used 13 times (the same number as the number of characteristics).

I conclude that it is most probable that:

Each activity ID is used by a single station for a particular sampling day. For any sampling date, you can link all the characteristics collected by the same station (on that date) using the activity ID.

(I might drop this column since we’ll only be looking at the characteristic “orthophosphate”.)

Activity Start

Data Type: Date

This is the date of when the sample was taken. As mentioned before, samples are taken the first day of each month. There are 5 years of data and so there are 60 days where samples were taken.

I have converted this attribute to a date and will use the new attribute “the_date” in place of this.

Activity Medium

Data Type: Nominal Categorical

The medium or substance we are testing for characterstics within. In this case we are only testing aspects of water so this is not very interesting. You can see below that the only medium is water. (This column will be dropped.)

water_df %>%
  group_by(`Activity Medium`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Activity Medium` samples
##   <chr>               <int>
## 1 Water              163574

Activity Type

Data Type: Nominal Categorical

The type of activity being performed. In this case, we have data of routine sampling of water. As you can see below, the only type is “Sample-Routine”. (This column will be dropped.)

water_df %>%
  group_by(`Activity Type`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Activity Type` samples
##   <chr>             <int>
## 1 Sample-Routine   163574

Activity Category-Rep Num

Data Type: Nominal Categorical

This column appears to be a duplicate of “Activity Type”. From the name, it appears that there should be numerical values.

However, as you can see below, we only have “Sample-Routine” as the sole value. Therefore, I have marked this as text data.

water_df %>%
  group_by(`Activity Category-Rep Num`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Activity Category-Rep Num` samples
##   <chr>                         <int>
## 1 Sample-Routine               163574

The metadata does not help other than saying this contains text. Without context, I cannot deduce what this attribute is for. (This column will be dropped.)

Characteristic Name

Data Type: Nominal Categorical

This is the characteristic or aspect of the water being measured. For example, there are samples with measurements for dissolved oxygen in water.

You can see below the number of samples for each characteristic name.

water_df %>%
  group_by(`Characteristic Name`) %>%
  summarize(samples = n())

## # A tibble: 13 x 2
##    `Characteristic Name`                          samples
##    <chr>                                            <int>
##  1 Ammonia-nitrogen                                 14665
##  2 Biochemical oxygen demand, standard conditions   14629
##  3 Chemical oxygen demand                             277
##  4 Dissolved oxygen (DO)                            14775
##  5 Flow                                             16166
##  6 Inorganic nitrogen (nitrate and nitrite)         14704
##  7 Kjeldahl nitrogen                                14704
##  8 Nitrogen                                         14704
##  9 Organic carbon                                     264
## 10 Organic Nitrogen                                 14665
## 11 Orthophosphate                                   14617
## 12 Phosphorus                                       14629
## 13 Total suspended solids                           14775

Sample Fraction

Data Type: Nominal Categorical

Describes what kind of “fraction” the measurement is of. For example, some measurements such as dissolved oxygen only measure the fraction of dissolved oxygen in the water.

You can see below the number of samples for each kind of sample fraction.

water_df %>%
  group_by(`Sample Fraction`) %>%
  summarize(samples = n())

## # A tibble: 3 x 2
##   `Sample Fraction` samples
##   <chr>               <int>
## 1 Dissolved           14775
## 2 Suspended           14775
## 3 Total              134024

Value Type

Data Type: Nominal Categorical

As you can see below, the only value this attribute takes is “Actual”. My best guess is that there could of been values that are estimates.

It seems that I should take the name of this attribute at face value. This attribute simply describes the type of value that is in “Result Value”. (Type as in observed vs estimated, etc.)

(I will be dropping this column.)

water_df %>%
  group_by(`Value Type`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Value Type` samples
##   <chr>          <int>
## 1 Actual        163574

Result Value Status

Data Type: Nominal Categorical

As you can see below, the only value this attribute takes is “Final”. Again, it seems that this attribute should be taken at face value.

This attribute describes the status of the “Result Value” and indicates whether the “Result Value” is fully processed or needs processing.

(I will be dropping this column.)

water_df %>%
  group_by(`Result Value Status`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Result Value Status` samples
##   <chr>                   <int>
## 1 Final                  163574

Result Value

Data Type: Continuous Numerical

This attribute is the measured/observed value.

For example, if a sample is measuring the dissolved oxygen characteristic, the corresponding value is a quantative value of the amount of dissolved oxygen.

Units

Data Type: Nominal Categorical

The units the “Result Value” is measured in.

Below you can see the two units used are “mg/l” and “mgd”.

water_df %>%
  group_by(`Units`) %>%
  summarize(samples = n())

## # A tibble: 2 x 2
##   Units samples
##   <chr>   <int>
## 1 mg/l   147408
## 2 mgd     16166

Analytical Proc ID

Data Type: Nominal Categorical

My best guess is that this attribute is the “Analytical Process ID”. Without context, I cannot deduce what this attribute is for. (This column will be dropped.)

As you can see below, the only value is “MDPOINTSOURCE~MDE-DMR”.

water_df %>%
  group_by(`Analytical Proc ID`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Analytical Proc ID`  samples
##   <chr>                   <int>
## 1 MDPOINTSOURCE~MDE-DMR  163574

Here is what I have gathered so far. Looking at the “~” delimeter, on the left is the name of this dataset “MD POINT SOURCE” and on the right is “MDE-DMR”.

MDE might stand for the Maryland Department of Environment:

https://mde.maryland.gov/Pages/index.aspx

DMR might stand for discharge monitoring report and I actually found another dataset (which I won’t be using because it’s out of scope of this “tutorial”):

https://echo.epa.gov/tools/data-downloads/icis-npdes-dmr-and-limit-data-set

Location 1

Data Type: Geolocation

After reviewing a few latitude + longitude coordinates on maps, I am fairly certain that this is the location of the station itself (rather than where the sample of the body of water was taken).

Below is a few examples of the format. Parsing will be needed.

water_df %>%
  group_by(`Location 1`) %>%
  summarize(samples = n())

## # A tibble: 266 x 2
##    `Location 1`                        samples
##    <chr>                                 <int>
##  1 "MARYLAND\n(37.966389, -76.018333)"     660
##  2 "MARYLAND\n(37.970278, -75.858333)"     660
##  3 "MARYLAND\n(37.989167, -76.041389)"     660
##  4 "MARYLAND\n(38.0625, -76.3325)"         630
##  5 "MARYLAND\n(38.083333, -75.583333)"     610
##  6 "MARYLAND\n(38.09, -75.42)"             660
##  7 "MARYLAND\n(38.094167, -75.799444)"     660
##  8 "MARYLAND\n(38.1008, -75.73)"            54
##  9 "MARYLAND\n(38.149722, -76.439167)"     660
## 10 "MARYLAND\n(38.158333, -75.838333)"     660
## # … with 256 more rows

Counties

Datatype: Discrete Numerical

This column was not included/listed in the metadata. The name is vague. Without context, I cannot deduce what this attribute is for.

Regardless, I will at least attempt a guess. Counties seems to have something to do with “County”. The table below counts the number of samples for each County and Counties pair.

water_df %>%
  group_by(County, Counties) %>%
  summarize(samples = n())

## # A tibble: 53 x 3
## # Groups:   County [24]
##    County       Counties samples
##    <chr>           <dbl>   <int>
##  1 ALLEGANY          417    7516
##  2 ALLEGANY         2806      90
##  3 ALLEGANY         2862     756
##  4 ANNE ARUNDEL      418   13228
##  5 ANNE ARUNDEL     1815     660
##  6 BALTIMORE         421     660
##  7 BALTIMORE        1763    5030
##  8 BALTIMORE        1857    1440
##  9 CALVERT          1764    1270
## 10 CALVERT          2656     660
## # … with 43 more rows

The table above has not given me much more insight. I do see that there are NA values for Counties. (From data not shown.) To the best of my knowledge, it is safe to drop this column.

NOAA Precipitation

Unlike the Water Point Source Sample data, the NOAA Precipitation data does have very good documentation as linked in Data Acquisition under The Data (“Hourly Precipitation Data Documentation” 2016).

Below is a preview of the data.

rain_df %>% head()

## # A tibble: 6 x 9
##   STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE               
##   <chr>   <chr>            <dbl>    <dbl>     <dbl> <dttm>             
## 1 COOP:1… BALTIMORE W…      47.5     39.2     -76.7 2008-01-01 01:00:00
## 2 COOP:1… BALTIMORE W…      47.5     39.2     -76.7 2008-01-03 11:00:00
## 3 COOP:1… BALTIMORE W…      47.5     39.2     -76.7 2008-01-03 12:00:00
## 4 COOP:1… BALTIMORE W…      47.5     39.2     -76.7 2008-01-05 19:00:00
## 5 COOP:1… BALTIMORE W…      47.5     39.2     -76.7 2008-01-05 20:00:00
## 6 COOP:1… BALTIMORE W…      47.5     39.2     -76.7 2008-01-05 23:00:00
## # … with 3 more variables: HPCP <dbl>, `Measurement Flag` <chr>, `Quality
## #   Flag` <chr>

As you can see below, there are 13,533 samples.

rain_df %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1 13533

One last important thing to note: According to the documentation, only hours when rain occurred are entered into this dataset (“Hourly Precipitation Data Documentation” 2016).

STATION

Data Type: Nominal Categorical

Cooperative identification number. This attribute uniquely identifies a station.

The table below shows the number of samples that each station has contributed to the dataset. There are 7 stations in total.

rain_df %>%
  group_by(STATION) %>%
  summarize(samples = n())

## # A tibble: 7 x 2
##   STATION     samples
##   <chr>         <int>
## 1 COOP:180015     800
## 2 COOP:180465    6540
## 3 COOP:180700    1532
## 4 COOP:183090     816
## 5 COOP:185718    1115
## 6 COOP:185934    1652
## 7 COOP:188065    1078

STATION_NAME

Data Type: Nominal Categorical

According to the documentation, this is: “A name identifying the station location” (“Hourly Precipitation Data Documentation” 2016).

The table below shows the number of samples associated with each station name in the dataset. There are 7 stations names in total and this matches up with the stations.

rain_df %>%
  group_by(STATION_NAME) %>%
  summarize(samples = n())

## # A tibble: 7 x 2
##   STATION_NAME                                     samples
##   <chr>                                              <int>
## 1 ABERDEEN PHILLIPS FIELD MD US                        800
## 2 BALTIMORE WASHINGTON INTERNATIONAL AIRPORT MD US    6540
## 3 BELTSVILLE MD US                                    1532
## 4 FEDERALSBURG MD US                                   816
## 5 MARYLAND SCIENCE CENTER MD US                       1115
## 6 MILLERS 4 NE MD US                                  1652
## 7 SAVAGE RIVER DAM MD US                              1078

ELEVATION

Data Type: Continuous Numerical

As the name of this attribute suggests, this is the elevation above sea level.

LATITUDE

Data Type: Geolocation

The latitude geographic coordinate of the station.

LONGITUDE

Data Type: Geolocation

The longitude geographic coordinate of the station.

DATE

Data Type: Date

The date that the measurement ended.

We might run into a“logical” problem here since the last measurment possible in the day is “00:00” (“Hourly Precipitation Data Documentation” 2016). The issue may arise if we are trying to perform operations grouped by the entire day.

HPCP

Data Type: Continuous Numerical

The amount of precipitation that the station measured. This value is cummulative. Since we asked for metric, this is in tenths of millimeters. Lastly, missing values are 99999 (“Hourly Precipitation Data Documentation” 2016).

Measurement Flag

Data Type: Nominal Categorical

Indicates potential problems or characteristics of the measurement. For example, “M” is for missing data and “g” is used for the very first entry. This is empty if “no flag is needed” (“Hourly Precipitation Data Documentation” 2016).

Quality Flag

Data Type: Nominal Categorical

Indicates characteristics of the quality of measurment. For example, “R” indicates that one of the “NCDC’s quality control tests” failed (“Hourly Precipitation Data Documentation” 2016).

Data Pre-processing

I will be detailing my data management. I will be discussing missing data.

My goals after processing the data is as follows:

The water sample data should include only the orthophosphate characterstic data and non-redundant attributes.
The precipitation data should have a new attribute that is the monthly mean cummulative rain (for hours it did rain) for each station.

Water Point Source Sample

The first thing I will do is filter so we’ll only have the orthophosphate characterstic. I’ll go ahead and also drop the columns I said I would definitely drop.

ortho_df <- water_df %>%
  filter(`Characteristic Name` == "Orthophosphate") %>%
  # note I am dropping this name because we already know it
  select(-`Characteristic Name`,
         -`Org Name`,
         -`Activity Medium`,
         -`Activity Type`,
         -`Activity Category-Rep Num`,
         -`Value Type`,
         -`Result Value Status`,
         -`Analytical Proc ID`,
         -`Counties`,
         # dropping this because we already have it parsed as "the_date"
         -`Activity Start`)

Another column I suspect we can drop is sample fraction.

ortho_df %>%
  group_by(`Sample Fraction`) %>%
  summarize(samples = n())

## # A tibble: 1 x 2
##   `Sample Fraction` samples
##   <chr>               <int>
## 1 Total               14617

As you can see above, this is no longer telling us any additional information. I drop the column below.

ortho_df <- ortho_df %>% select(-`Sample Fraction`)

The last thing I will do before missing data is parse the longitude and latitude. Note that I am still keeping “Location 1” as comparing doubles for exact match is not as easy as strings.

ortho_df <- ortho_df %>%
  mutate(loc1 = str_sub(`Location 1`, start = 9)) %>%
  separate(loc1, c("lat","lng"), sep = ",") %>%
  mutate(lat=as.numeric(str_replace(lat, "\\(", ""))) %>%
  mutate(lng=as.numeric(str_replace(lng, "\\)", "")))

ortho_df %>%
  select(`Station ID`, lat, lng) %>%
  head()

## # A tibble: 6 x 3
##   `Station ID`   lat   lng
##   <chr>        <dbl> <dbl>
## 1 MD0021598     39.6 -78.8
## 2 MD0021610     39.4 -77.4
## 3 MD0021610     39.4 -77.4
## 4 MD0021610     39.4 -77.4
## 5 MD0021601     39.2 -76.6
## 6 MD0021610     39.4 -77.4

At this point we can start handling missing data.

I saved the process of handling missing data until the now because filtering and dropping unnecessary columns first can simplify this process. (You could be handling missing data that wouldn’t even be there afterwards!)

Of course I had to be careful with filtering. Imputing the mean after filtering can have very different results compared to the reverse. I only filtered for the “orthophosphate” characteristic. This does not affect with values like the mean value for that particular characteristic.

I analyze the attributes in two batches so the results fits in the page.

ortho_df %>%
  summarize(sta_id_na = sum(is.na(`Station ID`)),
            county_na = sum(is.na(County)),
            huc_na =sum(is.na( HUC)),
            datum = sum(is.na(`Station Horizontal Datum`)),
            act_id_na = sum(is.na(`Activity ID`)))

## # A tibble: 1 x 5
##   sta_id_na county_na huc_na datum act_id_na
##       <int>     <int>  <int> <int>     <int>
## 1         0         0      6     0         0

For this set of attributes, I see there are 6 samples with the HUC (hydrologic unit code) NA. I am not too worried as I do not plan to use this in my final predictor.

ortho_df %>%
  summarize(date_na = sum(is.na(the_date)),
            res_val_na = sum(is.na(`Result Value`)),
            unit_na = sum(is.na(`Units`)),
            lat_na = sum(is.na(lat)),
            lng_na = sum(is.na(lng)))

## # A tibble: 1 x 5
##   date_na res_val_na unit_na lat_na lng_na
##     <int>      <int>   <int>  <int>  <int>
## 1       0          0       0      0      0

If you recall in the data description, I found there were NAs that were in character format. Below I investigate.

ortho_df %>% filter(`Station Horizontal Datum` == "UNKWN")

## # A tibble: 3 x 11
##   `Station ID` County   HUC `Station Horizo… `Activity ID` `Result Value`
##   <chr>        <chr>  <dbl> <chr>            <chr>                  <dbl>
## 1 MD0022993    UNSPE…    NA UNKWN            FY09-PS1399             2.52
## 2 MD0022993    UNSPE…    NA UNKWN            FY09-PS1400             2.52
## 3 MD0022993    UNSPE…    NA UNKWN            FY09-PS1401             2.52
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## #   the_date <dttm>, lat <dbl>, lng <dbl>

ortho_df %>% filter(County == "UNSPECIFIED")

## # A tibble: 6 x 11
##   `Station ID` County   HUC `Station Horizo… `Activity ID` `Result Value`
##   <chr>        <chr>  <dbl> <chr>            <chr>                  <dbl>
## 1 MD0022993    UNSPE…    NA UNKWN            FY09-PS1399             2.52
## 2 MD0022993    UNSPE…    NA UNKWN            FY09-PS1400             2.52
## 3 MD0022993    UNSPE…    NA UNKWN            FY09-PS1401             2.52
## 4 MD0066184    UNSPE…    NA NAD83            FY09-PS3085             0.7 
## 5 MD0066184    UNSPE…    NA NAD83            FY09-PS3086             0.8 
## 6 MD0066184    UNSPE…    NA NAD83            FY09-PS3087             1.1 
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## #   the_date <dttm>, lat <dbl>, lng <dbl>

It looks like this all goes back to the HUC. Recall there there were 6 samples with HUC as NA. Let’s see if we can drop these two stations altogether.

ortho_df %>% filter(`Station ID` == "MD0066184")

## # A tibble: 3 x 11
##   `Station ID` County   HUC `Station Horizo… `Activity ID` `Result Value`
##   <chr>        <chr>  <dbl> <chr>            <chr>                  <dbl>
## 1 MD0066184    UNSPE…    NA NAD83            FY09-PS3085              0.7
## 2 MD0066184    UNSPE…    NA NAD83            FY09-PS3086              0.8
## 3 MD0066184    UNSPE…    NA NAD83            FY09-PS3087              1.1
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## #   the_date <dttm>, lat <dbl>, lng <dbl>

ortho_df %>% filter(`Station ID` == "MD0022993")

## # A tibble: 3 x 11
##   `Station ID` County   HUC `Station Horizo… `Activity ID` `Result Value`
##   <chr>        <chr>  <dbl> <chr>            <chr>                  <dbl>
## 1 MD0022993    UNSPE…    NA UNKWN            FY09-PS1399             2.52
## 2 MD0022993    UNSPE…    NA UNKWN            FY09-PS1400             2.52
## 3 MD0022993    UNSPE…    NA UNKWN            FY09-PS1401             2.52
## # … with 5 more variables: Units <chr>, `Location 1` <chr>,
## #   the_date <dttm>, lat <dbl>, lng <dbl>

It seems that these stations each only have 3 data points. This will not be enough for our predictor anyway. (I plan to use data across the 5 years.) I remove these stations below.

ortho_df <- ortho_df %>%
  filter(!(`Station ID` %in% c("MD0066184", "MD0022993")))

So far I like this data! From the results above, it seems there is almost no missing data (we will see that this is not true for the precipitation dataset).

NOAA Precipitation

The very first thing to look at for this dataset is the Measurment Flag and Quality Flag. According to NOAA’s documentation, some data is imprecise, missing, or “indicators”. Some rows are simply just telling us when the rain began! (No value associated.)

Below we can see that some rows have the quality flag “R”.

rain_df %>%
  group_by(`Quality Flag`) %>%
  summarize(samples = n())

## # A tibble: 2 x 2
##   `Quality Flag` samples
##   <chr>            <int>
## 1 <NA>             13499
## 2 R                   34

As mentioned in the data description, “R” indicates the measurment failed a NCDC quality control test.

As stated prior, my plan is to use the average of the accumulated rain only for hours when it rained for each month.

The noise from this data should be negligble because there are only 34 samples and we are computing the average over a month. Thus, I will disregard this flag.

Let’s begin constructing our transformed data. I will begin by dropping the quality flag column.

rain_tidy <- rain_df %>% select(-`Quality Flag`)

Now onto the measurement flag. We can see below that over a quarter of the data has some special flag.

rain_tidy %>%
  group_by(`Measurement Flag`) %>%
  summarize(samples = n())

## # A tibble: 9 x 2
##   `Measurement Flag` samples
##   <chr>                <int>
## 1 <NA>                  9502
## 2 [                      182
## 3 ]                      184
## 4 {                       96
## 5 }                       96
## 6 a                        7
## 7 A                        7
## 8 g                      315
## 9 T                     3144

Let’s tackle the easiest first. As just mentioned, I only want to average the hours when it does rain. According to the documentation, “T” is for trace amounts. I will consider this as not raining. Below I remove these rows.

rain_tidy <- rain_tidy %>%
  # notice that we want to keep NAs
  # the added condition overrides the default filter behavior
  filter(`Measurement Flag` != "T" | is.na(`Measurement Flag`))

The flag “g” is also fairly simple. The documentation says this is used on the first day when accumulations starts and the value is 0 (“Hourly Precipitation Data Documentation” 2016). Below I remove these rows.

rain_tidy <- rain_tidy %>%
  # notice that we want to keep NAs
  # the added condition overrides the default filter behavior
  filter(`Measurement Flag` != "g" | is.na(`Measurement Flag`))

According to the documentation, “a” indicates when accumulation begins and “A” indicates when accumulation ends. There is no data associated when the flag is “a” and 99999 for “A” indicates “the accumulation continues into the next month” (“Hourly Precipitation Data Documentation” 2016).

Since the “a/A” flags only have 7 entries, I can just manually take a look. What I am concerned about is accumulation across months.

rain_tidy %>%
  # the filter condition is true if measurement flag is "a" or "A"
  filter(`Measurement Flag` %in% c("a", "A")) %>%
  select(STATION, DATE, HPCP, `Measurement Flag`)

## # A tibble: 14 x 4
##    STATION     DATE                    HPCP `Measurement Flag`
##    <chr>       <dttm>                 <dbl> <chr>             
##  1 COOP:183090 2010-08-05 01:00:00 25400.   a                 
##  2 COOP:183090 2010-08-06 00:00:00     2.54 A                 
##  3 COOP:183090 2010-08-07 01:00:00 25400.   a                 
##  4 COOP:183090 2010-08-08 00:00:00    15.2  A                 
##  5 COOP:183090 2010-08-08 01:00:00 25400.   a                 
##  6 COOP:183090 2010-08-09 00:00:00     2.54 A                 
##  7 COOP:183090 2010-08-13 01:00:00 25400.   a                 
##  8 COOP:183090 2010-08-14 00:00:00    12.7  A                 
##  9 COOP:183090 2010-08-18 01:00:00 25400.   a                 
## 10 COOP:183090 2010-08-19 00:00:00    22.9  A                 
## 11 COOP:183090 2010-08-19 01:00:00 25400.   a                 
## 12 COOP:183090 2010-08-20 00:00:00    15.2  A                 
## 13 COOP:183090 2010-08-23 01:00:00 25400.   a                 
## 14 COOP:183090 2010-08-24 00:00:00     7.62 A

Oh no! If you look at the table above, the value is 25399.75 instead of 99999! Was the documentation wrong? Nope, it was just that when we asked for metric data, it divided everything. The original units were hundredth inches which is approximately 3.9 tenth millimeters (our current unit).

The good news is none of these carry into another month! Hurray! Let’s just drop the non-data “a” rows.

rain_tidy <- rain_tidy %>%
  # notice that we want to keep NAs
  # the added condition overrides the default filter behavior
  filter(`Measurement Flag` != "a" | is.na(`Measurement Flag`))

The next is a little bit tricky. There is missing data encoded with “{}” and “[]”. This can be painful to process because these are intervals with two rows that indicate the start and stop.

What is really interesting is that not all the “[]” match! From a table above, there are 182 “[" but 184 “]”. My guess is that since we ordered a subset of the data, the extra “]” are ending the missing data from earlier (not in our subset).

Recall that I mentioned earlier that I want to compute some mean within a month.

Missing data is not problem if there are data points within that month. I can just impute the mean.

But if my end goal is to get the mean itself, then I can drop these rows! (Since the whole point of imputing the mean is to have the mean stay the same.)

The trickier problem comes if the data has a complete month missing. I want to do this “mean computation” for each month. If there is no data, then the mean itself is missing too!

In this case, I will look across years for that month and impute the mean from the mean of the other means! (I am assuming that a month, say January, will have similar weather over the years.)

Now we’ve reached a problem where we need to deal with dates. Recall that in the data description the “DATE” attribute is the end date. However, we run into the problem for edge cases where the measurement is taken at 12AM.

Ideally, we would want this rain data to count for the data before. (The rain happened and then the results were in once the clocked turned 12AM). Below we can see that this case indeed exists in our data.

rain_tidy %>%
  filter(hour(DATE) == 0) %>%
  select(STATION, DATE, HPCP, `Measurement Flag`)

## # A tibble: 601 x 4
##    STATION     DATE                 HPCP `Measurement Flag`
##    <chr>       <dttm>              <dbl> <chr>             
##  1 COOP:180465 2008-01-11 00:00:00  0.51 <NA>              
##  2 COOP:180465 2008-01-18 00:00:00  0.25 <NA>              
##  3 COOP:180465 2008-02-13 00:00:00  0.25 <NA>              
##  4 COOP:180465 2008-03-05 00:00:00  0.51 <NA>              
##  5 COOP:180465 2008-03-20 00:00:00  1.78 <NA>              
##  6 COOP:180465 2008-04-04 00:00:00  0.51 <NA>              
##  7 COOP:180465 2008-04-14 00:00:00  0.51 <NA>              
##  8 COOP:180465 2008-04-27 00:00:00  1.52 <NA>              
##  9 COOP:180465 2008-04-28 00:00:00  0.25 <NA>              
## 10 COOP:180465 2008-05-09 00:00:00  7.87 <NA>              
## # … with 591 more rows

There are 601 of these rows! Note that the POXIX format does not show the hour when the time is exactly on the start of the day (12AM).

The simplest solution is just to add a column for the starting date which I perform below.

rain_tidy <- rain_tidy %>% mutate(start_date = (DATE - hours(1)))

rain_tidy %>%
  select(STATION, DATE, start_date, HPCP) %>%
  head()

## # A tibble: 6 x 4
##   STATION     DATE                start_date           HPCP
##   <chr>       <dttm>              <dttm>              <dbl>
## 1 COOP:180465 2008-01-05 20:00:00 2008-01-05 19:00:00  0.25
## 2 COOP:180465 2008-01-06 02:00:00 2008-01-06 01:00:00  0.51
## 3 COOP:180465 2008-01-06 03:00:00 2008-01-06 02:00:00  0.76
## 4 COOP:180465 2008-01-06 04:00:00 2008-01-06 03:00:00  1.02
## 5 COOP:180465 2008-01-06 05:00:00 2008-01-06 04:00:00  1.78
## 6 COOP:180465 2008-01-06 06:00:00 2008-01-06 05:00:00  0.25

Now let’s get back to the missing data.

To see if there is actually some data that is not missing in a month, we can add a logical flag (0 or 1) for if that data is not missing. Then if we group and sum, the only possible value when all the data is missing is 0.

Note that when I say month, I am actually abbreviating for: a particular year’s month for a particular station.

rain_tidy <- rain_tidy %>%
  # we consider the data not to be missing if the flag is NA
  # or if the flag is not one of the missing flags
  mutate(not_missing = (is.na(`Measurement Flag`) |
                          !(`Measurement Flag` %in% c("{", "}", "[", "]"))
                        )) %>%
  group_by(STATION, year(start_date), month(start_date)) %>%
  mutate(num_not_missing = sum(as.numeric(not_missing))) %>%
  ungroup() %>%
  # We only need to keep if num_not_missing is 0 since
  # those are the samples we cannot compute the mean for.
  # If num_not_missing != 0, we will implicitly handle this when
  # we estimate the mean.
  # (Recall that our "estimate" mean then would be missing.)
  filter(not_missing | num_not_missing == 0)

The above code looks complex but all it does is handle dropping all the missing data we can safely drop.

Let’s see how much data were in the second case: where the entire month of data was missing for that station!

rain_tidy %>%
  group_by(`Measurement Flag`) %>%
  summarize(samples = n())

## # A tibble: 6 x 2
##   `Measurement Flag` samples
##   <chr>                <int>
## 1 <NA>                  9502
## 2 [                      136
## 3 ]                      138
## 4 {                       48
## 5 }                       48
## 6 A                        7

Wow, that’s quite a lot of stations with entire months of missing data!

Now I will compute the mean across all months for each station. I know there are 7 stations, 12 months in a year. So I expect 7 * 12 = 84 rows.

mean_df <- rain_tidy %>%
  filter(not_missing) %>%
  group_by(STATION, month(start_date)) %>%
  summarize(avg = mean(HPCP)) %>%
  rename(month = `month(start_date)`)

mean_df

## # A tibble: 84 x 3
## # Groups:   STATION [7]
##    STATION     month   avg
##    <chr>       <dbl> <dbl>
##  1 COOP:180015     1  2.83
##  2 COOP:180015     2  2.54
##  3 COOP:180015     3  2.78
##  4 COOP:180015     4  3.81
##  5 COOP:180015     5  3.51
##  6 COOP:180015     6  4.99
##  7 COOP:180015     7  5.20
##  8 COOP:180015     8  9.03
##  9 COOP:180015     9  4.39
## 10 COOP:180015    10  3.84
## # … with 74 more rows

Indeed we do get 84 rows. This is good. No station was super bad. We can deduce they all collected data for any month at least once in the span of 5 years.

We have everthing we need to start making the final summarized data. I am expecting 7 stations, 12 months in a year, and a span of 5 years.

7 * 12 * 6 = 504! So I expect 504 rows once we are done.

Let’s first construct a table with all the combinations. This will be the “skeleton” and tell use which values are implicitly missing

station_ids <- rain_tidy %>% distinct(STATION)
month_span <- seq(from = 1, to = 12)
year_span <- seq(from = 2008, to = 2013)

rain_final <- crossing(station_ids, month_span, year_span) %>%
  rename(month = month_span, year = year_span)

rain_final

## # A tibble: 504 x 3
##    STATION     month  year
##    <chr>       <int> <int>
##  1 COOP:180465     1  2008
##  2 COOP:180465     1  2009
##  3 COOP:180465     1  2010
##  4 COOP:180465     1  2011
##  5 COOP:180465     1  2012
##  6 COOP:180465     1  2013
##  7 COOP:180465     2  2008
##  8 COOP:180465     2  2009
##  9 COOP:180465     2  2010
## 10 COOP:180465     2  2011
## # … with 494 more rows

Now let’s compute what would have been our final table.

final_with_missing <- rain_tidy %>%
  filter(not_missing) %>%
  group_by(STATION, year(start_date), month(start_date)) %>%
  summarize(avg = mean(HPCP)) %>%
  rename(year = `year(start_date)`, month = `month(start_date)`)

final_with_missing

## # A tibble: 305 x 4
## # Groups:   STATION, year [35]
##    STATION      year month   avg
##    <chr>       <dbl> <dbl> <dbl>
##  1 COOP:180015  2008     5  3.32
##  2 COOP:180015  2008     6  3.81
##  3 COOP:180015  2008     7  5.41
##  4 COOP:180015  2008     8  3.67
##  5 COOP:180015  2008     9  5.17
##  6 COOP:180015  2008    10  2.72
##  7 COOP:180015  2008    11  3.29
##  8 COOP:180015  2008    12  2.94
##  9 COOP:180015  2009     1  2.83
## 10 COOP:180015  2009     2  2.54
## # … with 295 more rows

The code above computes the monthly mean (average) precipitation for the hours that rained for each station.

As you can see, there are only 305 rows! That is so much less than 504! We are missing much data. However, I am in too deep and we press forward.

There would be no way to tell which combinations were missing had we not generated the “skeleton”. Below I perform a left join to make the implicitly missing data, explicit.

rain_final <- rain_final %>%
  left_join(final_with_missing,
            by=c("STATION" = "STATION",
                 "year" = "year",
                 "month" = "month"))

rain_final

## # A tibble: 504 x 4
##    STATION     month  year    avg
##    <chr>       <dbl> <dbl>  <dbl>
##  1 COOP:180465     1  2008  0.81 
##  2 COOP:180465     1  2009  1.12 
##  3 COOP:180465     1  2010  1.02 
##  4 COOP:180465     1  2011  1.47 
##  5 COOP:180465     1  2012 NA    
##  6 COOP:180465     1  2013  1.51 
##  7 COOP:180465     2  2008  1.44 
##  8 COOP:180465     2  2009  0.505
##  9 COOP:180465     2  2010  1.44 
## 10 COOP:180465     2  2011  1.34 
## # … with 494 more rows

As you can see, station COOP:180465 had missing data on January 2012. Now all we have to do is impute the missing data.

First I will perform another left join so we can refer to the “global” month average. Remember we computed the mean_df before? Now we use it.

rain_final <- rain_final %>%
  left_join(mean_df %>% rename(global_avg = avg),
            by = c("STATION" = "STATION", "month" = "month"))

rain_final %>% head()

## # A tibble: 6 x 5
##   STATION     month  year   avg global_avg
##   <chr>       <dbl> <dbl> <dbl>      <dbl>
## 1 COOP:180465     1  2008  0.81       1.19
## 2 COOP:180465     1  2009  1.12       1.19
## 3 COOP:180465     1  2010  1.02       1.19
## 4 COOP:180465     1  2011  1.47       1.19
## 5 COOP:180465     1  2012 NA          1.19
## 6 COOP:180465     1  2013  1.51       1.19

We’re in the endgame now! The final step is easy. We just need to replace if the avg is NA.

rain_final <- rain_final %>% mutate(avg = if_else(is.na(avg), global_avg, avg))

rain_final %>% head()

## # A tibble: 6 x 5
##   STATION     month  year   avg global_avg
##   <chr>       <dbl> <dbl> <dbl>      <dbl>
## 1 COOP:180465     1  2008  0.81       1.19
## 2 COOP:180465     1  2009  1.12       1.19
## 3 COOP:180465     1  2010  1.02       1.19
## 4 COOP:180465     1  2011  1.47       1.19
## 5 COOP:180465     1  2012  1.19       1.19
## 6 COOP:180465     1  2013  1.51       1.19

As you can see above, we’ve essentially “borrowed” the value over from global_avg when avg was NA. For sanity check, let’s make sure we still have 504 rows in our final table!

rain_final %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1   504

Exploratory Data Analysis

Water Point Source Sample

What I became immediately curious about was how these stations were distributed across Maryland. I provide a map below.

map <- ortho_df %>%
  distinct(lat, lng, `Station ID`) %>%
    leaflet() %>%
    addTiles() %>%
    addMarkers(lng = ~lng,
               lat = ~lat,
               popup = ~`Station ID`,
               clusterOptions = markerClusterOptions())

map

This distribution looks good! It seems we are covering many of the waters in Maryland.

One thing I noticed that was peculiar was that just a few stations were at the same. It was very subtle but I’m glad I caught it! I investigate this below.

ortho_df %>%
  distinct(`Location 1`, `Station ID`) %>%
  group_by(`Location 1`) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 261 x 2
##    `Location 1`                        count
##    <chr>                               <int>
##  1 "MARYLAND\n(38.583333, -77.157778)"     4
##  2 "MARYLAND\n(39.359167, -76.370556)"     3
##  3 "MARYLAND\n(37.966389, -76.018333)"     1
##  4 "MARYLAND\n(37.970278, -75.858333)"     1
##  5 "MARYLAND\n(37.989167, -76.041389)"     1
##  6 "MARYLAND\n(38.0625, -76.3325)"         1
##  7 "MARYLAND\n(38.083333, -75.583333)"     1
##  8 "MARYLAND\n(38.09, -75.42)"             1
##  9 "MARYLAND\n(38.094167, -75.799444)"     1
## 10 "MARYLAND\n(38.149722, -76.439167)"     1
## # … with 251 more rows

As you can see above, there are 4 “different stations” occupying the same geolocation!

Below, we can see the Station IDs for these peculiar stations.

ortho_df %>%
  filter(`Location 1` == "MARYLAND\n(38.583333, -77.157778)" |
           `Location 1` == "MARYLAND\n(39.359167, -76.370556)") %>%
  distinct(`Station ID`, `Location 1`)

## # A tibble: 7 x 2
##   `Station ID`          `Location 1`                       
##   <chr>                 <chr>                              
## 1 MD0001201             "MARYLAND\n(39.359167, -76.370556)"
## 2 MD0003158-OUTFALL 025 "MARYLAND\n(38.583333, -77.157778)"
## 3 MD0003158-OUTFALL 010 "MARYLAND\n(38.583333, -77.157778)"
## 4 MD0021555-OUTFALL 001 "MARYLAND\n(39.359167, -76.370556)"
## 5 MD0003158-OUTFALL 085 "MARYLAND\n(38.583333, -77.157778)"
## 6 MD0003158-OUTFALL 021 "MARYLAND\n(38.583333, -77.157778)"
## 7 MD0021555-OUTFALL 002 "MARYLAND\n(39.359167, -76.370556)"

I looked up outfall. Basically it is water leaving a facility. My best guess is that these are for the multiple ways water can exit for some facility. I will worry about this when I have to (later).

Clustering

I will do some unsupervised learning with k-means clustering. I would like to know what regions are similar in their orthophosphate levels.

Note when doing clustering it is very important to scale the data so that the scales for all variables are similar. Let’s do that first.

Below, I first get a yearly average for all the unique geolocations.

lat_lng_year_avg <- ortho_df %>%
  mutate(the_year = year(the_date)) %>%
  group_by(the_year, lat, lng) %>%
  summarize(region_year_avg = mean(`Result Value`))

lat_lng_year_avg %>% head()

## # A tibble: 6 x 4
## # Groups:   the_year, lat [6]
##   the_year   lat   lng region_year_avg
##      <dbl> <dbl> <dbl>           <dbl>
## 1     2008  38.0 -76.0          2.52  
## 2     2008  38.0 -75.9          0.34  
## 3     2008  38.0 -76.0          2.52  
## 4     2008  38.1 -76.3          2.52  
## 5     2008  38.1 -75.6          0.102 
## 6     2008  38.1 -75.4          0.0117

Now we can standardize within each year across the geolocations. Below, I perform this operation.

cluster_this <- lat_lng_year_avg %>%
  group_by(the_year) %>%
  mutate(year_mean = mean(region_year_avg),
         year_stdev = sd(region_year_avg)) %>%
  ungroup() %>%
  mutate(std_val = (region_year_avg - year_mean) / year_stdev)

cluster_this %>% head()

## # A tibble: 6 x 7
##   the_year   lat   lng region_year_avg year_mean year_stdev std_val
##      <dbl> <dbl> <dbl>           <dbl>     <dbl>      <dbl>   <dbl>
## 1     2008  38.0 -76.0          2.52        1.83       1.74   0.396
## 2     2008  38.0 -75.9          0.34        1.83       1.74  -0.857
## 3     2008  38.0 -76.0          2.52        1.83       1.74   0.396
## 4     2008  38.1 -76.3          2.52        1.83       1.74   0.396
## 5     2008  38.1 -75.6          0.102       1.83       1.74  -0.994
## 6     2008  38.1 -75.4          0.0117      1.83       1.74  -1.05

Next, we scale the geolocation. In this case I compute a non-weighted mean. This is because there is no real significance of one location having more data.

In other words, we are simply trying to do a scaling operation to make the clustering work better.

Below I compute the non-weighted mean for location using distinct.

lat_stats <- cluster_this %>% distinct(lat) %>% summarize(avg = mean(lat), stdev = sd(lat))
lng_stats <- cluster_this %>% distinct(lng) %>% summarize(avg = mean(lng), stdev = sd(lng))

cluster_this <- cluster_this %>%
  mutate(std_lat = (lat - lat_stats$avg) / lat_stats$stdev,
         std_lng = (lng - lng_stats$avg) / lng_stats$stdev)

Let’s visualize the data. In the plot below, I plot based on geolocation and then add color to indicate the values.

cluster_this %>%
  ggplot(mapping = aes(x = std_lat, y = std_lng, color = std_val)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~the_year, ncol = 3) + 
  scale_color_gradient(low = "black", high = "red") +
  labs(y = "Standardized Longitude", x = "Standardized Latitude",
       color = "Std. Value",
       title = "Standardized Values against Geolocation")

It’s faint, but there are some red circles. And woah! It seems we have data 10 standard deviations away from the mean!

cluster_this %>%
  # here I plot the normal points
  filter(std_val < 2) %>%
  ggplot(mapping = aes(x = std_lat, y = std_lng)) +
  geom_point(alpha = 0.3) +
  # here I plot the exceptions
  geom_point(data = cluster_this %>% filter(std_val >= 2),
             mapping = aes(size = std_val),
             color = "red") +
  facet_wrap(~the_year, ncol = 3) +
  labs(y = "Standardized Longitude", x = "Standardized Latitude",
     color = "Std. Value",
     title = "Standardized Values against Geolocation with Exceptions Highlighted")

What could possibly explain these exceptions? I wonder if there may be error in my computation or data.

Below, I compute the summary statistics.

summary(lat_lng_year_avg$region_year_avg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2017  1.3833  1.5925  2.5200 19.3227

That very large max value explains it! It does not seem like this is an error. From the shape of Maryland, and the shape of the graph, it seems like the measurment is near Washington, D.C. This could be something like the Potomac River. Maybe that’s why the levels are high?

Below I run the clustering algorithm for k parameters 1 through 10.

set.seed(1234)
cluster_res <- cluster_this %>%
  # we only want to cluster on these three attributes
  select(std_lat, std_lng, std_val) %>%
  cluster::clusGap(stats::kmeans, K.max=10)

cluster_res

## Clustering Gap statistic ["clusGap"] from call:
## cluster::clusGap(x = ., FUNcluster = stats::kmeans, K.max = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 1
##           logW   E.logW       gap      SE.sim
##  [1,] 6.703889 7.522039 0.8181507 0.008482272
##  [2,] 6.462340 7.133386 0.6710458 0.007592232
##  [3,] 6.297964 6.997406 0.6994417 0.016910665
##  [4,] 6.230786 6.893768 0.6629815 0.008491294
##  [5,] 6.107800 6.796320 0.6885209 0.007084830
##  [6,] 6.022372 6.706946 0.6845739 0.006845093
##  [7,] 5.973502 6.648053 0.6745519 0.006307008
##  [8,] 5.834772 6.599722 0.7649499 0.010465435
##  [9,] 5.788874 6.563810 0.7749353 0.007963834
## [10,] 5.736389 6.534534 0.7981450 0.006800404

We can see an output table. Let’s visualize this. First we will need to tidy the results.

# get the table and convert it to a tidy tibble
result <- cluster_res$Tab %>%
  as_tibble() %>%
  # make the row's id into a attribute itself so we can graph it
  rowid_to_column("k")

Below I graph the gap statistic for each k value.

ggplot(result, mapping = aes(x = k, y = gap)) +
  geom_point() +
  geom_line() +
  geom_errorbar(mapping = aes(ymin = gap - SE.sim, ymax = gap + SE.sim)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
  labs(y = "Gap Statistic", x = "k Parameter",
     title = "Gap Statistic vs k Parameter")

From this plot it would seem that the optimal number of clusters is 1. For more information on gap statistics and cluster see in the references (Tibshirani, Walther, and Hastie 2002). Let’s double check this.

cluster::maxSE(cluster_res$Tab[ ,"gap"], cluster_res$Tab[ ,"SE.sim"],
      method = "Tibs2001SEmax",
      SE.factor = 1)

## [1] 1

Wow, this clustering really failed. This result was against my intuition since I expected areas that are close to have similar water contents. Intuitively, these areas would be very likely to be clustered together since all three attributes (latitude, longitude, orthophosphate levels) would be close.

I would like to return back to the topic of the distribution of the values. There seemed to be some extreme exceptions.

lat_lng_year_avg %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = the_year, y = region_year_avg, group = the_year)) +
  labs(y = "Average Orthophosphate (mg/l)", x = "Year",
     title = "Distribution of Average Orthophosphate by Year")

In the above plot we can see the distribution of orthophosphate levels for each year across location. We can see that there are some clear outliers.

NOAA Precipitation

Below I show a map for the stations that collect precipitation data.

rain_map <- rain_tidy %>%
  distinct(LATITUDE, LONGITUDE) %>%
  leaflet() %>%
  addTiles() %>%
  addMarkers(lng = ~LONGITUDE, lat = ~LATITUDE, clusterOptions = markerClusterOptions())

rain_map

Interestingly we get 10 unique geolocations yet there are only 7 stations. Below I output the unique geolocations.

rain_tidy %>%
  distinct(LATITUDE, LONGITUDE)

## # A tibble: 10 x 2
##    LATITUDE LONGITUDE
##       <dbl>     <dbl>
##  1     39.2     -76.7
##  2     39.0     -76.9
##  3     39.0     -76.9
##  4     39.3     -76.6
##  5     39.5     -76.2
##  6     39.5     -76.2
##  7     38.7     -75.8
##  8     39.7     -76.8
##  9     39.5     -79.1
## 10     39.5     -79.1

As you can see above, there appears to be some rounding errors. This will not be a problem as our final table uses only the station IDs.

The following graph shows the average hourly rain accumulation through time.

rain_final %>%
  mutate(the_date =
           # I kinda had to hack this date conversion
           # To get a date object, I compute a period and then add to year 0.
           as.Date(years(year) + months(month - 1),
                   origin = "0000-01-01")) %>%
  ggplot() +
  geom_line(mapping = aes(x = the_date, y = avg, color = STATION)) +
  labs(y = "Average Hourly Precipitation (Tenth Millimeter)", x = "Date",
     title = "Average Hourly Precipitation vs Time")

Note that “Average Hourly Precipitation” is for average for hours when it did rain.

It seems that the areas in Maryland get shifted levels of rainfall but they all follow a general trend. This graph is a little hard to see, let’s facet this.

rain_final %>%
  mutate(the_date =
           # I kinda had to hack this date conversion
           # To get a date object, I compute a period and then add to year 0.
           as.Date(years(year) + months(month - 1),
                   origin = "0000-01-01")) %>%
  ggplot() +
  geom_line(mapping = aes(x = the_date, y = avg)) +
  facet_wrap(~STATION, ncol = 4) +
  theme(axis.text.x = element_text(angle = 25, hjust = 0.75)) +
  labs(y = "Average Hourly Precipitation (Tenth Millimeter)", x = "Date",
     title = "Average Hourly Precipitation vs Time")

We can see there are roughly 6 spikes in rain for each stations This makes sense since this are 6 years of data and there are “rainy” months/seasons.

Methods

I will be showing you how to prepare the data.

Our features will be:

all the past rain data for that month (across all years)
the corresponding orthophosphate levels except for the latest month

Our target is the orthophosphate level for the latest month.

Joining the Datasets

Note that I want to establish a “causality” between rain and orthophosphate (in some sense). So I actually want the previous month of rain to be paired with each orthophosphate level. The orthophosphate is sampled at the beginning of each month, so we can almost think of this as the “result” of the previous month.

ortho_df <- ortho_df %>%
  mutate(binding_date = (the_date - months(1))) %>%
  # these make it easier to do a join as we'll see
  mutate(binding_month = month(binding_date)) %>%
  mutate(binding_year = year(binding_date))

Let’s get to splitting the data into our predictor data and outcome (target) data.

outcome_cutoff = ortho_df %>% summarize(max_date = max(the_date))
outcome_cutoff <- outcome_cutoff$max_date[1] - years(1)
outcome_cutoff

## [1] "2012-06-01 UTC"

Now we need to do some processing on rain final. I want each row to be for a particular month. So we will need to first combine the station ID and year. Then will use the spread function.

rain_join_this <- rain_final %>%
  # unite STATION and year with a "." in between, column name is spread_key
  unite("spread_key", STATION, year, sep = ".") %>%
  select(-global_avg) %>%
  spread(spread_key, avg)

rain_join_this

## # A tibble: 12 x 43
##    month `COOP:180015.20… `COOP:180015.20… `COOP:180015.20…
##    <dbl>            <dbl>            <dbl>            <dbl>
##  1     1             2.83             2.83             2.83
##  2     2             2.54             2.54             2.54
##  3     3             2.78             2.79             2.78
##  4     4             3.81             2.54             3.95
##  5     5             3.32             3.75             3.51
##  6     6             3.81             3.39             4.99
##  7     7             5.41             4.87             5.20
##  8     8             3.67            10.7              9.03
##  9     9             5.17             3.64             4.39
## 10    10             2.72             4.21             3.84
## 11    11             3.29             2.96             3.16
## 12    12             2.94             5.01             3.67
## # … with 39 more variables: `COOP:180015.2011` <dbl>,
## #   `COOP:180015.2012` <dbl>, `COOP:180015.2013` <dbl>,
## #   `COOP:180465.2008` <dbl>, `COOP:180465.2009` <dbl>,
## #   `COOP:180465.2010` <dbl>, `COOP:180465.2011` <dbl>,
## #   `COOP:180465.2012` <dbl>, `COOP:180465.2013` <dbl>,
## #   `COOP:180700.2008` <dbl>, `COOP:180700.2009` <dbl>,
## #   `COOP:180700.2010` <dbl>, `COOP:180700.2011` <dbl>,
## #   `COOP:180700.2012` <dbl>, `COOP:180700.2013` <dbl>,
## #   `COOP:183090.2008` <dbl>, `COOP:183090.2009` <dbl>,
## #   `COOP:183090.2010` <dbl>, `COOP:183090.2011` <dbl>,
## #   `COOP:183090.2012` <dbl>, `COOP:183090.2013` <dbl>,
## #   `COOP:185718.2008` <dbl>, `COOP:185718.2009` <dbl>,
## #   `COOP:185718.2010` <dbl>, `COOP:185718.2011` <dbl>,
## #   `COOP:185718.2012` <dbl>, `COOP:185718.2013` <dbl>,
## #   `COOP:185934.2008` <dbl>, `COOP:185934.2009` <dbl>,
## #   `COOP:185934.2010` <dbl>, `COOP:185934.2011` <dbl>,
## #   `COOP:185934.2012` <dbl>, `COOP:185934.2013` <dbl>,
## #   `COOP:188065.2008` <dbl>, `COOP:188065.2009` <dbl>,
## #   `COOP:188065.2010` <dbl>, `COOP:188065.2011` <dbl>,
## #   `COOP:188065.2012` <dbl>, `COOP:188065.2013` <dbl>

We will need to also get the orthophosphate data in the right format. If all goes well the below code should work.

predictor_df <- ortho_df %>%
  filter(the_date < outcome_cutoff) %>%
  select(`Station ID`, `Result Value`, binding_month, binding_year) %>%
  unite("station_month", `Station ID`, binding_month, sep = ".")# %>%
  # spread(binding_year, `Result Value`)

Oh no! As you can see I commented out code to show what we would have done. There was a unique key error. In other words, the station_month, binding year tuple is not unique! There are stations that double recorded!

predictor_df %>%
  group_by(station_month, binding_year) %>%
  summarize(samples = n()) %>%
  arrange(desc(samples))

## # A tibble: 11,453 x 3
## # Groups:   station_month [3,024]
##    station_month binding_year samples
##    <chr>                <dbl>   <int>
##  1 MD0000311.12          2009       2
##  2 MD0000311.1           2009       1
##  3 MD0000311.1           2010       1
##  4 MD0000311.1           2011       1
##  5 MD0000311.1           2012       1
##  6 MD0000311.10          2008       1
##  7 MD0000311.10          2009       1
##  8 MD0000311.10          2010       1
##  9 MD0000311.10          2011       1
## 10 MD0000311.11          2008       1
## # … with 11,443 more rows

In the above output we can see that “MD0000311” is the offending station.

Below I investigate further.

predictor_df %>% filter(station_month == "MD0000311.12", binding_year == "2009")

## # A tibble: 2 x 3
##   station_month `Result Value` binding_year
##   <chr>                  <dbl>        <dbl>
## 1 MD0000311.12            0.08         2009
## 2 MD0000311.12            0.05         2009

Let’s look back to the original data and see if maybe these samples were taken on different days.

ortho_df %>% filter(`Station ID` == "MD0000311",
                    binding_month == 12,
                    binding_year == "2009") %>%
  select(the_date)

## # A tibble: 2 x 1
##   the_date           
##   <dttm>             
## 1 2010-01-01 00:00:00
## 2 2010-01-01 00:00:00

So these are conflicting measurements on the same day! The simplest solution is just to do an average.

predictor_df <- predictor_df %>%
  group_by(station_month, binding_year) %>%
  summarize(avg_value = mean(`Result Value`)) %>%
  spread(binding_year, avg_value)

predictor_df

## # A tibble: 3,024 x 6
## # Groups:   station_month [3,024]
##    station_month `2008` `2009` `2010` `2011` `2012`
##    <chr>          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 MD0000311.1    NA     0.04   0.08   0.073  0.05 
##  2 MD0000311.10    0.07  0.06   0.073  0.142 NA    
##  3 MD0000311.11    0.03  0      0.09   0.142 NA    
##  4 MD0000311.12    0.1   0.065  0.073  0.05  NA    
##  5 MD0000311.2    NA     0.02   0      0.073  0.263
##  6 MD0000311.3    NA     0.06   0      0.073  0.064
##  7 MD0000311.4    NA     0.02   0      0.073  0.092
##  8 MD0000311.5    NA     0.02   0      0.073 NA    
##  9 MD0000311.6     0.08  0.04   0.05   0.213 NA    
## 10 MD0000311.7     0.12  0.16   0.07   0.071 NA    
## # … with 3,014 more rows

We can see there is much data missing.

Below, I compute a list of how many attributes are missing for each.

attributes_missing <- rowSums(is.na(predictor_df))

attributes_missing %>% head()

## [1] 1 1 1 1 1 1

We can see below that there is no perfect row (without any data missing). This is unfortunate.

min(attributes_missing)

## [1] 1

There should be a total of about 15 thousand attributes total (3000 * 5). We can see below that over 20% of the data is missing.

sum(attributes_missing)

## [1] 3667

The future step could be to impute by using perhaps the mean.

Now let’s join the two datasets! (The outcome data still needs to be made though.)

joined_data <- predictor_df %>%
  separate("station_month", c("station_id", "month"), sep = "\\.") %>%
  mutate(month = as.numeric(month)) %>%
  left_join(rain_join_this, by = "month")

joined_data

## # A tibble: 3,024 x 49
##    station_id month `2008` `2009` `2010` `2011` `2012` `COOP:180015.20…
##    <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>            <dbl>
##  1 MD0000311      1  NA     0.04   0.08   0.073  0.05              2.83
##  2 MD0000311     10   0.07  0.06   0.073  0.142 NA                 2.72
##  3 MD0000311     11   0.03  0      0.09   0.142 NA                 3.29
##  4 MD0000311     12   0.1   0.065  0.073  0.05  NA                 2.94
##  5 MD0000311      2  NA     0.02   0      0.073  0.263             2.54
##  6 MD0000311      3  NA     0.06   0      0.073  0.064             2.78
##  7 MD0000311      4  NA     0.02   0      0.073  0.092             3.81
##  8 MD0000311      5  NA     0.02   0      0.073 NA                 3.32
##  9 MD0000311      6   0.08  0.04   0.05   0.213 NA                 3.81
## 10 MD0000311      7   0.12  0.16   0.07   0.071 NA                 5.41
## # … with 3,014 more rows, and 41 more variables: `COOP:180015.2009` <dbl>,
## #   `COOP:180015.2010` <dbl>, `COOP:180015.2011` <dbl>,
## #   `COOP:180015.2012` <dbl>, `COOP:180015.2013` <dbl>,
## #   `COOP:180465.2008` <dbl>, `COOP:180465.2009` <dbl>,
## #   `COOP:180465.2010` <dbl>, `COOP:180465.2011` <dbl>,
## #   `COOP:180465.2012` <dbl>, `COOP:180465.2013` <dbl>,
## #   `COOP:180700.2008` <dbl>, `COOP:180700.2009` <dbl>,
## #   `COOP:180700.2010` <dbl>, `COOP:180700.2011` <dbl>,
## #   `COOP:180700.2012` <dbl>, `COOP:180700.2013` <dbl>,
## #   `COOP:183090.2008` <dbl>, `COOP:183090.2009` <dbl>,
## #   `COOP:183090.2010` <dbl>, `COOP:183090.2011` <dbl>,
## #   `COOP:183090.2012` <dbl>, `COOP:183090.2013` <dbl>,
## #   `COOP:185718.2008` <dbl>, `COOP:185718.2009` <dbl>,
## #   `COOP:185718.2010` <dbl>, `COOP:185718.2011` <dbl>,
## #   `COOP:185718.2012` <dbl>, `COOP:185718.2013` <dbl>,
## #   `COOP:185934.2008` <dbl>, `COOP:185934.2009` <dbl>,
## #   `COOP:185934.2010` <dbl>, `COOP:185934.2011` <dbl>,
## #   `COOP:185934.2012` <dbl>, `COOP:185934.2013` <dbl>,
## #   `COOP:188065.2008` <dbl>, `COOP:188065.2009` <dbl>,
## #   `COOP:188065.2010` <dbl>, `COOP:188065.2011` <dbl>,
## #   `COOP:188065.2012` <dbl>, `COOP:188065.2013` <dbl>

Conclusions

In conclusion, the data is a large part of the effort. I underestimated the time it would take to wrangle the data.

One of the biggest bottlenecks is missing data. So unfortunately, I did not get to answering the question of can we predict the orthophosphate level.

I hope that this “tutorial” helps someone and perhaps someone will do supervised machine learning with the processed data.

References

Chislock, Michael F., E. Doster, R. A. Zitomer, and A. E. Wilson. 2013. “Eutrophication: Causes, Consequences, and Controls in Aquatic Ecosystems.” Nature Education Knowledge. https://www.nature.com/scitable/knowledge/library/eutrophication-causes-consequences-and-controls-in-aquatic-102364466.

“Hourly Precipitation Data Documentation.” 2016. National Oceanic and Atmospheric Administration. https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/PRECIP_HLY_documentation.pdf.

“Hydrologic Unit Maps.” 2018. USGS Water Resources of the United States. U.S. Department of Interior | U.S. Geological Survey. https://water.usgs.gov/GIS/huc.html.

Shumway, Sandra E. 1990. “A Review of the Effects of Algal Blooms on Shellfish and Aquaculture.” Journal of the World Aquaculture Society 21 (2): 65–104. doi:10.1111/j.1749-7345.1990.tb00529.x.

Soltis-Muth, Cheryl. n.d. “TP - It Stands for Total Phosphate.” http://www.ohiowea.org/docs/OWEA_Phosphorus_CSM.pdf.

“The Elements of Geodesy: The Horizontal Datum.” 2004. NOAA National Ocean Service Education: Geodesy. US Department of Commerce | National Oceanic and Atmospheric Administration. https://oceanservice.noaa.gov/education/kits/geodesy/geo05_horizdatum.html.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2002. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2): 411–23. doi:10.1111/1467-9868.00293.

Zhang, Feng, Jiyoung Lee, Song Liang, and Ck Shum. 2015. “Cyanobacteria Blooms and Non-Alcoholic Liver Disease: Evidence from a County Level Ecological Study in the United States.” Environmental Health 14 (1). doi:10.1186/s12940-015-0026-7.

Final Project

Derek Zhang

5/22/2019

Introduction

The Question

Motivation

Orthophosphate

Eutrophication

What can we do?

Data Acquisition

Finding Data

The Data

Working With Data

Required Packages

Loading Data

Description of Dataset

Water Point Source Sample

Org Name

Station ID

County

HUC

Station Horizontal Datum

Activity ID

Activity Start

Activity Medium

Activity Type

Activity Category-Rep Num

Characteristic Name

Sample Fraction

Value Type

Result Value Status

Result Value

Units

Analytical Proc ID

Location 1

Counties

NOAA Precipitation

STATION

STATION_NAME

ELEVATION

LATITUDE

LONGITUDE

DATE

HPCP

Measurement Flag

Quality Flag

Data Pre-processing

Water Point Source Sample

NOAA Precipitation

Exploratory Data Analysis

Water Point Source Sample

Clustering

NOAA Precipitation

Methods

Joining the Datasets

Conclusions

References