{ "cells": [ { "cell_type": "markdown", "id": "f2f69444", "metadata": {}, "source": [ "# CSC380 Homework 5 : Data Analysis and Visualization\n", "\n", "**Overview** This homework will familiarize you with the basic steps involved in reading, analyzing, and visualizing data. We will use the [Starbucks Nutrition Dataset](https://www.kaggle.com/starbucks/starbucks-menu) which itemizes most of the food and drink (12oz) options available at the Starbucks coffee chain. To simplify things we have processed the data for you into a JSON file distributed with the homework (filename: starbucks.json). We will be using the Pandas library to load and manipulate data. I briefly introduced all of the Pandas functionality that will need in class and additional links are provided inline below. As always, you can find all relevant material on the [CSC380 Webpage](http://pachecoj.com/courses/csc380_fall21/).\n", "\n", "**What to turn in** Please submit the completed Jupyter notebook to D2L. Make sure it is the .ipynb file, not a .html file! All cells are marked with instructions to insert your code. Please complete all cells as directed.\n", "\n", "**Installing Pandas** \n", "To install any python library just type:\n", "\n", "!pip3 install \"library name\"\n", "\n", "Or, if you are using Anaconda then type:\n", "\n", "!conda install \"library name\"\n", "\n", "The cell below can be used to install Pandas. Or you can do it on the command line." ] }, { "cell_type": "code", "execution_count": 3, "id": "ce35c216", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Uncomment and run the line below to install Pandas using pip\n", "#!pip3 install pandas\n", "\n", "# Uncomment and run the line below to install Pandas using Anaconda\n", "#!conda install pandas" ] }, { "cell_type": "code", "execution_count": 1, "id": "6bd8ee0a", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import pandas as pd " ] }, { "cell_type": "markdown", "id": "177f5abe", "metadata": {}, "source": [ "## Problem 1 : Basic Operations and Stats from the dataset (1 point)" ] }, { "cell_type": "markdown", "id": "3cbd9fc4", "metadata": {}, "source": [ "Download the data \"starbucks.json\" and load it to create a Pandas DataFrame.\n", "\n", "What is a python DataFrame ? - https://www.geeksforgeeks.org/python-pandas-dataframe/\n", "\n", "Hint : Check out the read_json function - https://www.w3schools.com/python/pandas/pandas_json.asp " ] }, { "cell_type": "code", "execution_count": 1, "id": "786721b9", "metadata": { "scrolled": false }, "outputs": [], "source": [ "starbucks_df = #insert your code here\n", "starbucks_df" ] }, { "cell_type": "markdown", "id": "177f4f23", "metadata": {}, "source": [ "Printing the entire dataframe looks cumbersome. How can we look at the first and last **two** rows of a dataframe?\n", "\n", "Check out .head() and .tail() - https://www.tutorialspoint.com/python_pandas/python_pandas_basic_functionality.htm\n", "\n", "What are the first two and last two rows on the dataframe?" ] }, { "cell_type": "code", "execution_count": null, "id": "c92b342c", "metadata": { "scrolled": true }, "outputs": [], "source": [ "starbucks_df.#insert your code here" ] }, { "cell_type": "markdown", "id": "1328f4d9", "metadata": {}, "source": [ "How can we access just a column of a dataset in pandas? https://cmdlinetips.com/2020/04/3-ways-to-select-one-or-more-columns-with-pandas/.\n", "\n", "It is okay if while printing you only see first and last few element and dots in between; this is Python's way of summarizing the output.\n", "\n", "Print the column 'Beverage_prep'" ] }, { "cell_type": "code", "execution_count": null, "id": "1da3405d", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "c9fe934a", "metadata": {}, "source": [ "One goal of data science is to use data in order to answer questions. This is done in an automated way, without us having to manually go through the data. Let's try answering some simple questions about Starbucks' menu items." ] }, { "cell_type": "markdown", "id": "40642a57", "metadata": {}, "source": [ "### a. On an average, how much caffeine does a starbucks drink have?" ] }, { "cell_type": "markdown", "id": "bf65f524", "metadata": {}, "source": [ "Hint: Checkout the math functions of a pandas dataframe. \n", "\n", "https://erikrood.com/Python_References/pandas_column_average_median_final.html" ] }, { "cell_type": "code", "execution_count": null, "id": "7867820d", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "88a2c1d4", "metadata": {}, "source": [ "### b. What is the *typical* (median) amount of caffeine in a starbucks drink?" ] }, { "cell_type": "code", "execution_count": null, "id": "5b6b5f91", "metadata": {}, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "6578590b", "metadata": {}, "source": [ "### c. What is the maximum amount of caffeine you can find at starbucks in its drinks? " ] }, { "cell_type": "code", "execution_count": null, "id": "b3628978", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "f7df6db8", "metadata": {}, "source": [ "### d. What is the least amount of caffeine you can find at starbucks in its drinks? " ] }, { "cell_type": "code", "execution_count": null, "id": "4aa496ee", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "5518de70", "metadata": {}, "source": [ "## Problem 2 : Pie Chart (2 points)" ] }, { "cell_type": "markdown", "id": "5f64660a", "metadata": {}, "source": [ "Let's explore the dataset we have a bit more further" ] }, { "cell_type": "markdown", "id": "87be94ff", "metadata": {}, "source": [ "### a. What are the different type of Drinks (ie Beverage Category )that Starbucks has? How much of each?" ] }, { "cell_type": "markdown", "id": "9ba8f689", "metadata": {}, "source": [ "Hint - Checkout pandas value_counts() function." ] }, { "cell_type": "code", "execution_count": null, "id": "6e3b9585", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#print the different beverage category and how much of each here\n", "\n", "#insert your code here" ] }, { "cell_type": "markdown", "id": "d73cd9e6", "metadata": {}, "source": [ "Let's make these more appealing. Plot these as a pie chart" ] }, { "cell_type": "code", "execution_count": null, "id": "d8980a2c", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "beverage_category_counts = #insert your code here\n", "labels = #insert your code here\n", "sizes = #insert your code here\n", "\n", "#insert your code here\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "835aface", "metadata": {}, "source": [ "## Problem 3 : Bar Chart (2 points)" ] }, { "cell_type": "markdown", "id": "1e9e98cb", "metadata": {}, "source": [ "Suppose you have a very calorie conscious friend. But they really like to get the drinks at Starbucks. As a budding Data Scientist, you want to help them out." ] }, { "cell_type": "markdown", "id": "fdc27585", "metadata": {}, "source": [ "### a. What is the drink with the least amount of calories at Starbucks" ] }, { "cell_type": "markdown", "id": "19d6da87", "metadata": {}, "source": [ "Hint : Check this out ==> https://www.interviewqs.com/ddi-code-snippets/rows-cols-python" ] }, { "cell_type": "code", "execution_count": null, "id": "f4c88119", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#you can print the entire row or just the name\n", "\n", "#insert your code here" ] }, { "cell_type": "markdown", "id": "722f9b51", "metadata": {}, "source": [ "But they are quickly bored of this drink. I mean, it's only natural. \n", "\n", "So, let's recommend them a beverage category instead." ] }, { "cell_type": "markdown", "id": "533e3e16", "metadata": {}, "source": [ "First let's find on an average how much calories do each beverage category have?\n", "\n", "Hint - Checkout groupby function. The first example in this page is what we are trying to do. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html" ] }, { "cell_type": "code", "execution_count": null, "id": "8b7a6bf0", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "5b5cb33c", "metadata": {}, "source": [ "### b. Plot a bar Graph" ] }, { "cell_type": "markdown", "id": "869f40d3", "metadata": {}, "source": [ "Let's make this visually appealing by plotting a bar graph, where the height of the bar plot is average amount of calories. \n", "\n", "Hint : Check this out -> https://benalexkeen.com/bar-charts-in-matplotlib/" ] }, { "cell_type": "code", "execution_count": null, "id": "124ae5c8", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "5cca2e73", "metadata": {}, "source": [ "### By looking at the graph, which beverage category has the least average calories?" ] }, { "cell_type": "code", "execution_count": null, "id": "c8f7a425", "metadata": {}, "outputs": [], "source": [ "print('The answer')" ] }, { "cell_type": "markdown", "id": "5cd3cd98", "metadata": {}, "source": [ "Let's keep looking\n", "\n", "### By looking at the graph, which beverage category has the second least average calories?" ] }, { "cell_type": "code", "execution_count": null, "id": "771ea7cd", "metadata": { "scrolled": true }, "outputs": [], "source": [ "print('The Answer')" ] }, { "cell_type": "markdown", "id": "9e9d6fbe", "metadata": {}, "source": [ "This gives us some idea of how many calories to expect in each beverage category. But we know from our previous classes that taking just the mean is not a good representation of how the values are spread. In this case, while the average is useful, we need to know how it is spread across various drinks within a beverage category.\n", "\n", "### What is the standard deviation of calories within each beverage categories?" ] }, { "cell_type": "code", "execution_count": null, "id": "65508776", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "af67aa0f", "metadata": {}, "source": [ "If you get a nan for Coffee in the above cell, just add .fillna(0) at the end. To read more about fillna(0) - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html\n", "\n", "Else skip the below cell" ] }, { "cell_type": "code", "execution_count": null, "id": "636739b8", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here.fillna(0)" ] }, { "cell_type": "markdown", "id": "cb739088", "metadata": {}, "source": [ "Now Let's incorporate this info into the bar chart as well. We want a bar chart where there is 1 bar for each beverage category, the height is average calories, and error bars representing +/- sample STDEV Hint : go back to https://benalexkeen.com/bar-charts-in-matplotlib/" ] }, { "cell_type": "code", "execution_count": null, "id": "eb8dd771", "metadata": { "scrolled": true }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "36f0d0f8", "metadata": {}, "source": [ "Look how easy it is to understand that many numbers when visualised well!\n", "\n", "Awesome work so far!! " ] }, { "cell_type": "markdown", "id": "cbe67eb1", "metadata": {}, "source": [ "## Problem 4 : Scatter plot (2 points)" ] }, { "cell_type": "markdown", "id": "13f784c2", "metadata": {}, "source": [ "Now another friend of yours, who absolutely loves Caffeine came to you for a recommendation.They want to know what are the top drinks with the most Caffeine in Starbucks. They would like to know how much sugar each of them may have too, since they would like to reduce that.They don't like numbers much,so we want to present this to them in a attractive way. Let's start by sorting the DataFrame based on Caffeine\n", "\n", "Hint: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html" ] }, { "cell_type": "code", "execution_count": null, "id": "94cf325b", "metadata": { "scrolled": false }, "outputs": [], "source": [ "#insert your code here" ] }, { "cell_type": "markdown", "id": "e6fc7572", "metadata": {}, "source": [ "What are the top 10 **drinks** with the most caffeine in them? \n", "\n", "Hint : Remember.head() from earlier. Use that." ] }, { "cell_type": "code", "execution_count": null, "id": "85cde918", "metadata": { "scrolled": false }, "outputs": [], "source": [ "top10_Caf_Drink = #insert your code here\n", "top10_Caf_Drink" ] }, { "cell_type": "markdown", "id": "a535cfb2", "metadata": {}, "source": [ "We don't really care about the other nutritions at this point. Let's just print what is needed." ] }, { "cell_type": "code", "execution_count": null, "id": "082e15ae", "metadata": { "scrolled": true }, "outputs": [], "source": [ "top10_Caf_Drink[['Beverage','Sugars (g)','Caffeine (mg)']]" ] }, { "cell_type": "markdown", "id": "b4a6f681", "metadata": {}, "source": [ "Oops, why does the same drink keep repeating but with different calories and caffeine? Give yourself a minute before reading the next line for the answer.\n", "\n", "Yes, they are prepared differently. Let's add that too, since it is relevent information" ] }, { "cell_type": "code", "execution_count": null, "id": "87958d5f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "top10_Caf_Drink[['Beverage','Beverage_prep','Sugars (g)','Caffeine (mg)']]" ] }, { "cell_type": "markdown", "id": "448c10c3", "metadata": {}, "source": [ "Now that we have the beverages with the prep, sugar and caffeine,we need to show this to our friend. Let's plot them as a need scatter plot. Caffeine on x, Sugars on y." ] }, { "cell_type": "code", "execution_count": null, "id": "bd1dc1e8", "metadata": { "scrolled": true }, "outputs": [], "source": [ "x = #Caffeiene\n", "y = #Sugar\n", "\n", "beverages = top10_Caf_Drink['Beverage'].to_list()\n", "beverage_prep = top10_Caf_Drink['Beverage_prep'].to_list()\n", "labels = [ str(beverages[i]) + ' with' + str(beverage_prep[i]) for i in range(len(top10_Caf_Drink))]\n", "\n", "#insert your code here to plot the graph here\n", "\n", "plt.xlabel(\"Caffeine content\")\n", "plt.ylabel(\"Sugar (g)\")\n", "plt.title(\"Sugar in the top 10 Most caffeinated Drinks.\")\n", "\n", "axes = plt.gca()\n", "axes.set_xlim([0,450])\n", "\n", "for i, txt in enumerate(labels):\n", " plt.annotate(txt, (x[i], y[i]))\n", " \n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "id": "64a9ccb2", "metadata": {}, "source": [ "Nice work!!\n", "\n", "\n", "# Conclusion\n", "In this assignment, we were able to donwload a dataset, load it as a pandas dataframe, explore the dataset with basic statistical functions and visulaise many specifc examples to answer relevent queries from the topic.\n", "\n", "Congragulations!!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }