TDM 20100: Project 4 — 2022
Motivation: Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs.
Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
Scope: grep, regular expression basics, UNIX utilities, redirection, piping
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/stackoverflow/unprocessed/*
-
/anvil/projects/tdm/data/stackoverflow/processed/*
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
For this project, please submit a |
Question 1
In a csv file, there are n+1 columns where n is the number of commas (in theory). Take the first line of unprocessed/2011.csv
, replace all commas with the newline character, \n
, and use wc
to count the resulting number of lines. This should approximate how many columns are in the dataset. What is the value?
This can’t be right, can it? Print the first 100 lines after using tr
to replace commas with newlines. What do you notice?
The newline character in UNIX is |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in processed/2011.csv
. How many columns are there?
Take a look at iowa_liquor_sales_cleaner.txt
— how many columns does that file have?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Continuing to look at the iowa_liquor_sales_cleaner.txt
dataset, what are the 5 largest orders by number of bottles sold? How about by Gallons sold?
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
What are the different sizes (in ml) that a bottle of liquor comes in?
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Benford’s law states that the leading digit in real-life sets of numerical data, the leading digit is likely to follow a distinct distribution (see the plot in the provided link). By this logic, the dollar amount in the orders should roughly match this, right?
Use any available bash
tools you’d like to get a good idea of the count or percentage of the sales (in dollars) by starting digit. Are the results expected? Could there be some "funny business" going on? Write 1-2 sentences explaining what you think.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |