Analyzing Honey Production: EDA with Python
- Background
- Objectives
- Importing the necessary packages
- Data Loading & Overview
- Exploring The Data
- Count of Colonies
- Spread of yield per colony
- Distribution of total production of honey
- Distribution of cost of honey per pound
- Distribution of value of production
- Distribution of stocks held by producers
- How Numerical Values Relate to one another
- Numerical Data Correlations
- Trends by
state
andyear
- Most & Least Honey Production by State
- Similarly we can check the which state is producing costliest and cheapest honey
- YoY Honey Production Trend
- Variation in the number of colonies over the years
- YoY yield per colony
- YoY Production By State
- YoY number of colonies and yield per colony in 5 prominent states
- Effect of the declining production trend on the value of production
- Comparing the total production with the stocks held by the producers
- Average price per pound of honey across states
- Conclusions
Background
Honeybee Population Declining
In 2006, the decline in the honeybee population was becoming a concern, as honeybees have an integral place in American honey agriculture.
A Recognized Disorder
Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees that causes the remaining hive colonies to collapse. Speculation on the cause of this disorder points to hive diseases and pesticides harming the pollinators, tho no overall consensus has been reached. The U.S. previously produced more than half the honey it consumed per year. Since then, honey has become primarily imported, with 350 of the 400 million pounds of honey consumed every year originating from imports.Investigating The Data
This dataset provides insight into honey production supply and demand in America from 1998 to 2016.Objectives
To visualize how honey production has changed over the years (1998–2016) in the United States.Key questions to be answered:
- How has honey production yield changed from 1998 to 2016?
- Over time, what have been the major production trends across the states?
- Are there any pattern that can be observed between total honey production and the value of production every year? How has the value of production, which in some sense could be tied to demand, changed every year?
In [1]:
# NOTE: for codelab cloud env
# !pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 -q --user
In [2]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Command to tell Python to actually display the graphs
%matplotlib inline
# To supress numerical display in scientific notations
pd.set_option('display.float_format', lambda x: '%.2f' % x)
In [3]:
honeyprod = pd.read_csv("honeyproduction1998-2016.csv")
In [4]:
honeyprod.head()
Out [4]:
- State: Various states in the U.S.
- year: Year of production
- stocks: Refers to stocks held by producers. Unit is pounds
- numcol: Number of honey-producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies that did not survive the entire year
- yieldpercol: honey yield per colony. The unit is in pounds
- totalprod: Total production (numcol x yieldpercol). Unit is pounds
- priceperlb: Refers to average price per pound based on expanded sales. The unit is dollars.
- prodvalue: Value of production (totalprod x priceperlb). The unit is dollars.
In [5]:
dataShape = honeyprod.shape
print(f'rows: {dataShape[0]}\ncols: {dataShape[1]}')
In [6]:
honeyprod.info()
- There is only one object datatype column with 7 numerical datatypes
- All the columns have 785 observations, which means none of the columns has null values
In [7]:
honeyprod.describe()
Out [7]:
- Number of colonies in every state are spread over a huge range. Ranging from 2000 to 510000
- The average number of colonies is close to the 75% percentile of the data, indicating a right skew
- The standard deviation of numcol columns is very high*
In [8]:
plt.figure(figsize=(15, 7))
sns.histplot(data= honeyprod, x= "numcol", kde= True);
In [9]:
sns.boxplot(data = honeyprod, x = 'numcol');
Observations
- Most of the data is concentrated within the range of 0-50000, which means most of the states have around 50000 which are producing honey- The distribution is right-skewed with a lot of outliers towards the higher end
In [10]:
plt.figure(figsize=(15, 7))
sns.histplot(data= honeyprod, x= "yieldpercol", kde= True);
In [11]:
sns.boxplot(data = honeyprod, x = 'yieldpercol');
Observations
- Distribution looks like almost evenly distributed with little skewness- Yield per colony of honey has a right skewed distribution with a lot of outliers towards the higher end
- The median yield per colony is close to 60 pounds
In [12]:
plt.figure(figsize=(15, 7))
sns.histplot(data= honeyprod, x= "totalprod", kde= True);
In [13]:
sns.boxplot(data = honeyprod, x = 'totalprod');
Observations
- Total production has a right-skewed distribution with a lot of outliers towards the higher end- The median of total production is nearly 0.1 pounds
- Since total production is related to the number of colonies, the distribution is almost similar to that variable
In [14]:
plt.figure(figsize=(15, 7))
sns.histplot(data= honeyprod, x= "priceperlb", kde= True);
In [15]:
sns.boxplot(data = honeyprod, x = 'priceperlb');
Observations
- Most of the honey is priced between 0-2 dollars- Price per pound of honey has a right-skewed distribution with a lot of outliers towards the higher end
- The median price per pound of honey is 1.5 dollars
In [16]:
plt.figure(figsize=(15, 7))
sns.histplot(data= honeyprod, x= "prodvalue", kde= True);
In [17]:
sns.boxplot(data = honeyprod, x = 'prodvalue');
Observations
- Production value has a right-skewed distribution with a lot of outliers towards the higher end- The median production value is 0 and 1 dollars
In [18]:
plt.figure(figsize=(15, 7))
sns.histplot(data= honeyprod, x= "stocks", kde= True);
In [19]:
sns.boxplot(data = honeyprod, x = 'stocks');
Observations
- Stocks held by producers has a right-skewed distribution with a lot of outliers towards the higher end- The median stocks held by producers is close to 0, which shows the majority of the producers hold very less stocks to themselves
In [20]:
sns.pairplot(honeyprod, diag_kind="kde");
In [21]:
# Removing the 'state' and 'year' columns
honeyprod_without_state = honeyprod.drop(columns=['state','year'])
# Calculate the correlation
correlation = honeyprod_without_state.corr() # creating a 2-D Matrix with correlation plots
correlation
Out [21]:
In [22]:
plt.figure(figsize=(15, 7))
sns.heatmap(correlation, annot=True, cmap="Spectral");
Observations
- Number of colonies have a high positive correlation with total production, stocks and the value of production. As expected, all these values are highly correlated with each other- Yield per colony does not have a high correlation with any of the features that we have in our dataset
- Same is the case with priceperlb
- Determining the factors influencing per colony yield and price per pound of honey would need further investigation
Similarly, we can explore the other two variables as well i.e. state
and year
columns
Trends by state
and year
Which states may be producing the most honeyIn [23]:
honeyprod['state'].unique()
Out [23]:
In [24]:
#
# how many UNIQUE states are making honey in this dataset
#
honeyprod['state'].nunique()
Out [24]:
In [25]:
top10_totalprod= honeyprod.groupby('state').sum()[['totalprod']].sort_values('totalprod', ascending=False).reset_index().head(10) #top 10 states producing maximum honey
top10_totalprod
Out [25]:
In [26]:
plt.figure(figsize=(20,6))
sns.catplot(data=top10_totalprod,x= 'state', y='totalprod', kind='bar')
plt.title('Top 10 Honey Production States')
plt.xticks(rotation=90);
In [27]:
bottom10_totalprod= honeyprod.groupby('state').sum()[['totalprod']].sort_values('totalprod', ascending=False).reset_index().tail(10) #top 10 states producing minimum honey
bottom10_totalprod
Out [27]:
In [28]:
plt.figure(figsize=(20,6))
sns.catplot(data=bottom10_totalprod,x= 'state', y='totalprod', kind='bar');
plt.title('Least 10 Honey Production States')
plt.xticks(rotation=90);
Observations
- North Dakota is producing the maximum amount of honey followed by California and South Dakota as compared to other states- Oklahoma is producing the least amount of honey in total followed by Maryland and South California
In [29]:
top10_price= honeyprod.groupby('state').sum()[['priceperlb']].sort_values('priceperlb', ascending=False).reset_index().head(10) #top 10 states producing maximum honey
top10_price
Out [29]:
In [30]:
plt.figure(figsize=(20,6))
sns.catplot(data=top10_price,x= 'state', y='priceperlb', kind='bar')
plt.title('Most Expensive Honey Production States')
plt.xticks(rotation=90);
In [31]:
bottom10_price= honeyprod.groupby('state').sum()[['priceperlb']].sort_values('priceperlb', ascending=False).reset_index().tail(10) #top 10 states producing minimum honey
bottom10_price
Out [31]:
In [32]:
plt.figure(figsize=(20,6))
sns.catplot(data=bottom10_price,x= 'state', y='priceperlb', kind='bar')
plt.title('Least Expensive Honey Production States')
plt.xticks(rotation=90);
Observations
- Virginia is producing the costliest honey followed by Illinois and North Carolina as compared to other states- Averagely Oklahoma is producing the cheapest honey followed by Maryland and South Carolina
In [33]:
plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, errorbar=None)
plt.title('Year-Over-Year Honey Production')
plt.xticks(rotation=90);
Observations
- The overall honey production in the US has been decreasing over the years- Total honey production = number of colonies * average yield per colony. Let's check if the honey production is decreasing due to one of these factors or both
In [34]:
plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='numcol', data=honeyprod, errorbar=None, estimator=sum)
plt.title('Year-Over-Year Number Of Honey Colonies')
plt.xticks(rotation=90);
Observations
- The number of colonies across the country shows a declining trend from 1998-2008 but has seen an uptick after 2008- It is possible that there was some intervention in 2008 that help in increasing the number of honey bee colonies across the country
In [35]:
plt.figure(figsize=(15, 7))
sns.pointplot(x='year', y='yieldpercol', data=honeyprod, estimator=sum, errorbar=None)
plt.xticks(rotation=90);
Observations
- In contrast to the number of colonies, the yield per colony has been decreasing since 1998- This indicates that it is not the number of colonies that is causing a decline in total honey production but the yield per colony
In [36]:
# Add hue parameter to the pointplot to plot for each state
plt.figure(figsize=(15, 7)) # To resize the plot
sns.pointplot(x='year', y='totalprod', data=honeyprod, estimator=sum, errorbar=None, hue = 'state')
plt.title('Year-Over-Year Production BY State')
plt.legend(bbox_to_anchor=(1, 1))
plt.xticks(rotation=90);
Observations
- some states have much higher productions than the others but this plot is a little hard to read- Let's try plotting each state separately for a better understanding
Individual State Production Small Multiples
In [37]:
sns.catplot(x='year',
y='totalprod',
data=honeyprod,
estimator=sum,
col='state',
kind="point",
col_wrap = 5);
Observations
- The most prominent honey producing states of US are - California, Florida, North Dakota and South Dakota and Montana- Unfortunately, the honey production in California has seen a steep decline over the years
- Florida's total production also has been on a decline
- South Dakota has more of less maintained its levels of production
- North Dakota has actually seen an impressive increase in the honey production
In [38]:
cplot1=sns.catplot(x='year', y='numcol',
data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])],
estimator=sum, col='state', kind="point",
col_wrap = 5)
cplot1.set_xticklabels(rotation=90);
In [39]:
cplot2=sns.catplot(x='year', y='yieldpercol',
data=honeyprod[honeyprod["state"].isin(["North Dakota","California","South Dakota","Florida","Montana"])],
estimator=sum, col='state', kind="point",
col_wrap = 5)
cplot2.set_xticklabels(rotation=90);
- In North Dakota, the number of colonies has increased significantly over the years as compared to the other 4 states
- If we check the yield per colony, it has been in an overall decreasing trend for all the 5 states over the years
In [40]:
sns.pointplot(x="year", y="prodvalue", data=honeyprod, errorbar=None)
plt.xticks(rotation=90);
Observations
- This is an interesting trend. As the total production has declined over the years, the value of production per pound has increased over time- As the supply declined, the demand has added to the value of honey
In [41]:
plt.figure(figsize = (20,15)) # To resize the plot
# Plot total production per state
sns.stripplot(x="state", y="totalprod", data=honeyprod.sort_values("totalprod", ascending=False),
color="b", jitter=True)
plt.xticks(rotation=90);
In [42]:
plt.figure(figsize = (20,15)) # To resize the plot
# Plot stocks per state
sns.stripplot(x="state", y="stocks", data=honeyprod.sort_values("totalprod", ascending=False),
color="r", jitter=True)
plt.xticks(rotation=90);
Observations
- North Dakota has been able to sell more honey as compared to South Dakota despite having the highest production value- Florida has the highest efficiency among the major honey-producing states
- Michigan is more efficient than Wisconsin in selling honey
In [43]:
plt.figure(figsize=(25, 7)) # To resize the plot
sns.swarmplot(data = honeyprod, x = "state", y = "priceperlb",
)
plt.xticks(rotation=90);
- Virginia has the highest price per pound of honey
- The average price per pound of honey in the major honey-producing states is towards the lower end
- The total honey production has declined over the years
- The production value per pound has increased
- *The reason for the declined production seems to be due to the decrease in the yield per colony
- Top honey-producing states are California, Florida, North Dakota, South Dakota and Montana*
- Florida has been very efficient in selling honey
Page Tags:
python
data-science
jupyter
learning
numpy