TDM 10100: Project 10 — 2022
Motivation: As we have learned functions are foundational to more complex programs and behaviors.
There is an entire programming paradigm based on functions called functional programming.
Context:
We will apply functions to entire vectors of data using sapply
. We learned how to create functions, and now the next step we will take is to use it on a series of data. sapply
is one of the best ways to do this in R
.
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/okcupid/filtered/users.csv
-
/anvil/projects/tdm/data/okcupid/filtered/questions.csv
Helpful Hint
read.csv() function automatically delineates by a comma`,`
You can use other delimiters by using adding the sep
argument
i.e. read.csv(…sep=';')
Use the readlines(…,n=x)
function to see the first x number of rows to identify what the character that you will use in the sep
argument.
Questions
ONE
We want to go ahead and load the datasets into data.frames named users
and questions
. Take a look at both data.frames and identify what is a part of each of them. What information is in each datatset, and how they are related?
-
Code used to solve this problem.
-
Output from running the code.
-
1 or 2 sentences on the datasets.
TWO
Simply put, grep
helps us to find a word within a string. In R
grep
is vectorized and can be applied to an entire vector of strings. We will use it to find the any questions that mention google
in the data.frame questions
.
-
What do you notice if you just use the function
grep()
and create a new variable google and then print that variable? -
Now that you know the row number, how can you take a look at the information there?
(Bonus question: can find a shortcut to steps a & b?)
Helpful Hint
grep - grep()
is a function in R
that is used to search for matches of a pattern within each element of a string.
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
Insider Information
Just an FYI refresh:
-
←
is an assignment operator, it assigns values to a variable -
Functions must be called using the round brackets aka parenthesis
()
-
Square brackets
[]
, are also calledextraction operators
as they are used to help extract specific elements from a vector or matrix.
-
Code used to solve this problem.
-
Output from running the code.
THREE
-
Using the row from our previous question, which variable does this correspond with in the data.frame
users
? -
Knowing that the two possible answers are "No. Why spoil the mystery?" and "Yes, Knowledge is power!" What percentage of users do NOT google someone before the first date?
Helpful Hint
-
Row 2172 in
questions
corresponds to column namedq170849
inusers
-
The
table()
function can be used to quickly create frequency tables -
The
prop.table()
function can calculate the value of each cell in a table as a proportion of all values.
-
Code used to solve this problem.
-
Output from running the code.
FOUR
Using the ability to create a function AND tapply
find the percentages of Female vs Male (Man vs Woman, as categorized in the users data.frame) who DO google someone before their date.
Helpful Hint
-
tapply()
function can be used to apply some function to a vector that has been grouped by another vector.tapply(x, INDEX, FUNCTION)
-
Code used to solve this problem.
-
Output from running the code.
FIVE
Using the ability to create a function AND using sapply()
write a function that takes the string and removes everything after/including the _ from the gender_orientation
column in the users
data.frame. Or it is OK to solve this question as given in the video, without a function and without sapply()
.
meaning that Hetero_male → Hetero, we want to do this for the entire column gender_orientation
Insider Information
Sapply()- allows you to iterate over a list or vector without the need to use a for loop which is typically a slow way to work in R
.
Remember the difference
(a very
brief summary of each)
-
A vector is the basic data structure in
R
they typically are atomic vectors and lists and have three common properties -
Type- typeof()
-
Length- length()
-
Attributes- attributes() They are different due to the type of elements they hold. All elements in an atomic vector must be the same(they are also always "flat"), but elements of a list can be different types. construction of lists are done by using the function
list()
. The construction of atomic vectors are done by using the functionc()
. You can determine specific type by using functions like is.character(), is.double(), is.integer(), is.logical() -
A matrix is a two-dimensional; rows and columns and all cells must be the same type. Can be created with the function
matrix()
. -
An array can be one dimension multi-dimensional. An array with one dimension is similar (but not exact) as a vector. An array with two dimensions is similar (but not exact) as a matrix. An array with three or more dimensions is an n-dimensional array. can be created with the function
array()
. -
A data frame is like a table, or like a matrix, BUT the columns can hold different types of data.
-
Code used to solve this problem.
-
Output from running the code.
Resources
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |