[Explaining] Numpy and PyPlot

NumPy is a foundational library for data science in Python. It excels at handling multidimensional arrays and matrices, along with offering a rich collection of mathematical functions to operate on this data. This makes NumPy a powerful tool for data manipulation, calculations, and extracting insights from raw data efficiently. Because of its capabilities, NumPy even serves as the underlying foundation for other popular data science libraries like Pandas. In order to brightest the insights that comes from the insights of NumPy, in this post I’ll combine it with matplotlib – the graphical library that adding the ability to create a useful plots, histograms and another graphical schemes.

To truly unlock the insights revealed by NumPy’s data analysis, we can leverage the power of Matplotlib. This visualization library allows us to create informative plots, histograms, and other graphical representations of the data. By combining NumPy’s calculations with Matplotlib’s visualizations, we can gain a deeper understanding of the patterns and trends hidden within our data.

Background Story

Let’s dive into the data and see if our assumption holds true! In this tour, we’ll explore the correlation between weight in adult males and females. While it’s commonly believed that men tend to weigh more, we’ll use data analysis to confirm this. Alright, first things first! Let’s collect some real data. The realer, the better! For the experimental I asked my co-scientist, Gemini, to recruit some volunteers with informed consent, of course, and measure their weight and height.

men’s list 👨

RankNameHeight (cm)Weight (kg)
1John Smith17581.5
2David Lee18095.4
3Michael Kim17277.1
4Ryan Jones17875.22
5Charles Lee17699.25
6William Chen18280.15
7Andrew Brown17078
8Daniel Miller17989.1
9Kevin Garcia17483.8
10Thomas Hall18190.3
11Lev Lewinski17369.3

women’s list 👩

RankNameHeight (cm)Weight (kg)
1Alice Brown17069.4
2Beatrice Lee16270.88
3Chloe Garcia18375.6
4Diana Miller17868.55
5Emily Chen16558.9
6Fiona Jones188100.5
7Gloria Kim17560.43
8Hannah Smith16852.3
9Isabella Lee18072.52
10Olivia Hall17265.23
11Sarit Haddad16879.87

Technical Analysis

We’ll start by bringing in NumPy, referred to as np for convenience. Then, we can create three separate arrays to store data for each sex category. Our goal is to analyze the characteristics of BMI, which is:

A calculation tool invented by a statistician in the 19th century that measures the ratio of height to weight.

After creating a BMI table for each sex, we will calculate the mean, median, and then compare the groups.

import numpy as np
# Data for males
m_names = np.array(["John Smith", "David Lee", "Michael Kim", "Ryan Jones",
                    "Charles Lee", "William Chen", "Andrew Brown", "Daniel Miller",
                    "Kevin Garcia", "Thomas Hall", "Lev Lewinski"])
m_heights = np.array([1.75, 1.80, 1.72, 1.78, 1.76, 1.82, 1.70, 1.79, 1.74, 1.81, 1.73]) # in Meters NOT CM
m_weights = np.array([81.5, 95.4, 77.1, 75.22, 99.25, 80.15, 78, 89.1, 83.8, 90.3, 69.3])

# Data for females
f_names = np.array(["Alice Brown", "Beatrice Lee", "Chloe Garcia", "Diana Miller", 
                    "Emily Chen", "Fiona Jones", "Gloria Kim", "Hannah Smith", 
                    "Isabella Lee", "Olivia Hall", "Sarit Haddad"])
f_heights = np.array([1.70, 1.62, 1.83, 1.78, 1.65, 1.88, 1.75, 1.68, 1.80, 1.72, 1.68])
f_weights = np.array([69.4, 70.88, 75.6, 68.55, 58.9, 100.5, 60.43, 52.3, 72.52, 65.23, 79.87])

Alright, with our data wrangled into arrays, let’s get down to calculating BMI! Just to satisfy my curiosity, I’ll print out the calculation results in a neat and organized way for each group.

# Ccalculating BMI for each group 
male_bmi = np.array(m_weights / (m_heights**2))
female_bmi = np.array(f_weights / (f_heights**2))

print("Males BMI:", male_bmi, "\n","Females BMI:",female_bmi)

Output:

If you prefer to work with sorted data, the following code snippet can help you achieve that:

# First of all i'd like to merge the data into one array per sex type
male_data = np.core.records.fromarrays([m_names, m_heights, m_weights, male_bmi], names='name, height, weight, bmi')
female_data = np.core.records.fromarrays([f_names, f_heights, f_weights, female_bmi], names='name, height, weight, bmi')

# Sort the structured array based on the 'bmi' field
male_srtd = male_data[male_data['bmi'].argsort()]
female_srtd = female_data[female_data['bmi'].argsort()]

Now when the entire data is sorted by the values of the BMI, let’s place a histogram to visualize our data:

import matplotlib.pyplot as plt

plt.hist([male_bmi, female_bmi],label=['Male BMI', 'Female BMI'])
plt.xlabel('BMI')
plt.ylabel('Number of People')
plt.title('Comparison of Male and Female BMI Distribution')
plt.legend()
plt.grid(True)

# Display the histogram
plt.show()

Resulted histogram:

Based on our initial analysis, it appears that women tend to have lower weights compared to men. Now I’d like to analyze two more crucial details: median and average

# Average calculation
mens_mean = np.mean(male_bmi)
womens_mean = np.mean(female_bmi)

# Median Calculation
mens_median = np.median(male_bmi)
womens_median = np.median(female_bmi)

# Printing the results
print("Men's average is:",mens_mean, "\n Women's average is:", womens_mean)

print("Men's median is:",mens_median, "\n Women's media is:", womens_median)

output:

How about using bar charts to take a look at this from another angle?

# Labels for the x-axis
labels = ['Men (Average)', 'Men (Median)', 'Women (Average)', 'Women (Median)']

# Data for the bar chart
data = [mens_mean, mens_median, womens_mean, women_median]

# Create the bar chart
plt.figure(figsize=(8, 6))  # Adjust figure size as needed
plt.bar(labels, data, color = ["skyblue","skyblue","red","red"])
plt.xlabel('Gender and Statistic')
plt.ylabel('BMI')
plt.title('Comparison of Men and Women\'s BMI (Average and Median)')
plt.grid(axis='y')  # Grid lines only on y-axis
plt.show()

Conclusions

This process has been a valuable learning experience for me. It’s important to note that all the data used in this post is fictional. The names were generated with the assistance of Gemini, but the data itself was created specifically for educational purposes. I hope you learned something too, see you in the next post! 👋


Photo by Campaign Creators on Unsplash

Leave a comment

I’m Dolev

I’m a 33-year-old dad of three and a Master’s student in Information Science.
After years in business development, I pivoted to data science—my true passion.
Now I’m a Senior Data Scientist at a mobile gaming startup.
This site is where I share my journey, projects, and insights.

Recent comments

No comments to show.