The primary data tools I use at my place of employment have been SQL, SAS, and R.
Learning Python has long been on my to-do list, while somehow never rising to the top.
In January I determined that this would be the year for me to dip my toe in, download the program, and play around with it.
It never ceases to amaze me how challenging the initial experience with a new program can be.
Simple issues can send me on in-depth internet searches in which I never quite resolve the problem but rather uncover multiple work arounds producing something close-ish to my end goal.
Hopefully, if I ignore the question long enough my skill level will eventually catch up, rendering the issue moot.
Below is my first attempt at completing a short analysis in Python.
This is not intended to be a shining example of my analysis skills, or really even my ability to follow very, very basic web tutorials.
Rather, it is fully exposed example of what an initial step can look like when trying to learn a new skill.
It isn’t very glamorous, but it exists.
By pulling all of this together, no matter the quality, it provides me with something concrete to build on the next attempt.
That is worth more to me than the actual product.
Data from this project was sourced from the "Banana Quality" dataset posted to kaggle by l3LlFF. Data was downloaded on March 11th, 2024.
First, I pulled the basic information about the columns available in the Banana dataset using the '.info' command. I present it below in image form for better styling.
#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#load dataset
banana = pd.read_csv(r'banana_quality.csv')
#export table with column list in image format
buffer = io.StringIO()
banana.info(buf=buffer)
lines = buffer.getvalue().splitlines()
df = (pd.DataFrame([x.split() for x in lines[5:-2]], columns=lines[3].split())
.drop('Count',axis=1)
.rename(columns={'Non-Null':'Non-Null Count'}))
plt.close('all')
plt.axis('off')
plt.table(cellText=df.values,colWidths = [0.25]*len(df.columns),
rowLabels=df.index,
colLabels=df.columns,
cellLoc = 'center', rowLoc = 'center',
loc='top')
plt.savefig('01_info.png',bbox_inches="tight", dpi=200)
Next, I pulled some more detailed information about the dataset using the '.describe()' command.
#export table with column description in image format
df2 = banana.describe().transpose()
plt.close('all')
plt.axis('off')
plt.table(cellText=df2.values,colWidths = [0.25]*len(df2.columns),
rowLabels=df2.index,
colLabels=df2.columns,
cellLoc = 'center', rowLoc = 'center',
loc='top')
plt.savefig('02_describe.png',bbox_inches="tight", dpi=150)
I reviewed the correlations of all of the numeric variables in the dataset.
#Build a correlation matrix with numeric variables
banana_narrow = banana[['Size', 'Weight', 'Sweetness', 'Softness',
'HarvestTime', 'Ripeness', 'Acidity']]
matrix = banana_narrow.corr()
plt.close('all')
plt.axis('off')
plt.table(cellText=matrix.values,colWidths = [0.25]*len(matrix.columns),
rowLabels=matrix.index,
colLabels=matrix.columns,
cellLoc = 'center', rowLoc = 'center',
loc='top')
plt.savefig('03_correlation.png',bbox_inches="tight", dpi=150)
I plotted the correlation strengths for an alternative approach.
#Export a correlation plot
alpha = ['Size', 'Weight', 'Sweetness', 'Softness',
'HarvestTime', 'Ripeness', 'Acidity']
plt.close('all')
figure = plt.figure()
axes = figure.add_subplot(111)
caxes = axes.matshow(matrix, interpolation ='nearest')
figure.colorbar(caxes)
axes.set_xticklabels(['']+alpha, rotation = 50)
axes.set_yticklabels(['']+alpha, rotation = 50)
title_corr = 'Correlation Heatmap of Banana Characteristics'
plt.title( title_corr )
plt.savefig('04_heat.png',bbox_inches="tight", dpi=150)
Finally, I compared the sweetness rating by the quality (good/bad) of the bananas using a box plot.
#Export a boxplot of sweetness and quality
plt.close('all')
banana.boxplot(column='Sweetness', by='Quality', patch_artist=True, boxprops = dict(facecolor = "darkgrey"),medianprops = dict(color = "yellow", linewidth = 2))
title_boxplot = 'Sweetness Rating by Quality'
plt.title( title_boxplot )
plt.suptitle('')
plt.grid(False)
plt.savefig('05_boxplot.png',bbox_inches="tight", dpi=150)
All in all, this short project showed bunches of promise but ended up being a tough banana to peel.
I learned a little about Python and more about into-level HTML coding.
Looking forward for the next ripe analysis ready to be picked.