TDM 10100: Project 5 — Fall 2023
Motivation: R
differs from other programing languages in that R
works great with vectorized functions and the apply suite of functions (instead of using loops).
The apply family of functions provide an alternative to loops. You can use |
Context: We will focus in this project on efficient ways of processing data in R
.
Scope: tapply function
Dataset(s)
The following questions will use the following dataset in Anvil:
/anvil/projects/tdm/data/election/escaped2020sample.txt
A txt and csv file both store information in plain text. Data in csv files are almost always separated by commas. In txt files, the fields can be separated by commas, semicolons, pipe symbols, tabs, or other separators. To read in a txt file in which the data is add sep="|" (see code below)
You might want to use 3 cores in this project when you setup your Jupyter Lab session. |
Data Understanding
The file uses '|' (instead of commas) to separate the data fields. The reason is that one column of data contains full names, which sometimes include commas.
head(myDF)
When looking at the head of the data frame, notice that the entries in the TRANSACTION_DT
column have the month, day, and year all crammed together without any slashes between them.
lubridate
The lubridate
package can be used to put a column into a date format. In general, data that contains information about dates can sometimes be hard to put into a date format, but the lubridate
package makes this easier.
library(lubridate, warn.conflicts = FALSE)
myDF$newdates <-mdy(myDF$TRANSACTION_DT)
A new column newdates
is created, with the same data as the TRANSACTION_DT
column but now stored in date
format.
Feel free to check out the official cheatsheet to learn more about the lubridate
package.
tapply
tapply() helps us apply functions (for instance: mean, median, minimum, maximum, sum, etc…) to data, one group at a time. The tapply() function is most helpful when we need to break data into groups, applying a function to each of the groups of data.
The tapply
function takes three inputs:
Some data to work on; a way to break the data into groups; and a function to apply to each group of data.
tapply(myDF$TRANSACTION_AMT, myDF$newdates, sum)
-
The
tapply
function applies cansum
themyDF$TRANSACTION_AMT
data, grouped according tomyDF$newdates
-
Three inputs for tapply
-
myDF$TRANSACTION_AMT
: the data vector to work on -
myDF$newdates
: the way to break the data into groups -
sum
: the function to apply on each piece of data
-
Questions
Question 1 (1.5 pts)
-
Use the
year
function (from thelubridate
library) on the columnnewdates
, to create a new column namedTRANSACTION_YR
. -
Using
tapply
, add the values in theTRANSACTION_AMT
column, according to the values in theTRANSACTION_YR
column. -
Plot the years on the x-axis and the total amount of the transactions by year on the y-axis.
Question 2 (1.5 pts)
-
From Question 1, you may notice that the majority of the data collected is found in the years 2019-2020. Please create a new dataframe that only contains data for the dates in the range 01/01/2020-12/31/2020.
-
Using
tapply
, get the sum of the money in theTRANSACTION_AMT
column, grouped according to the months January through December (in 2020 only). -
Plot the months on the x-axis and the total amount of the transactions (for each month) on the y-axis.
Question 3 (1.5 pts)
Let’s go back to using the full set of data across all of the years (from Question 1). We can continue to experiment with the tapply
function.
-
Please find the donor who gave the most money (altogether) in the whole data set.
-
Find the total amount of money given (altogether) in each state. Then sort the states, according to the total amount of money given altogether. In which 5 states was the most money given?
-
What are the ten zipcodes in which the most money is donated (altogether)?
Question 4 (2 pts)
-
Using a
barplot
ordotchart
, plot the total amount of money given in each of the top five states. -
Using a
barplot
ordotchart
, plot the total amount of money given in each of the top ten zipcodes.
Question 5 (1.5 pts)
-
Analyze something that you find interesting about the election data, make a plot to demonstrate your insight, and then explain your finding with a few sentences of explanation.
Project 05 Assignment Checklist
-
Jupyter Lab notebook with your code and comments for the assignment
-
firstname-lastname-project05.ipynb
.
-
-
R code and comments for the assignment
-
firstname-lastname-project05.R
.
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |