# Workshop 24. Visualization.¶

In [ ]:
import pandas as pd

Out[ ]:
division level of education training level work experience salary sales
0 computer hardware some college 1 5 81769 302611
1 peripherals bachelor's degree 1 4 89792 274336
2 peripherals high school 0 5 70797 256854
3 office supplies associate's degree 1 5 82236 279598
4 peripherals some college 1 5 73725 261014

# Matplotlib¶

In [ ]:
# Prepare Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:
# Change the size of the figures
from matplotlib import rcParams
rcParams['figure.figsize'] = 14, 9


### Simple plot¶

In [ ]:
# Arguments are x and y coordinates as python lists
xx = [1, 2, 3, 4]
yy = [2, 3, 5, 9]
plt.plot(xx, yy)

# Arguments are x and y coordinates as numpy arrays
import numpy as np
xx = np.linspace(-1, 2, 10)
yy = xx**2
plt.plot(xx, yy)

# If the first argument is missing, [0, 1, 2, 3, 4, ...] will be
# used automatically
plt.plot(yy)

Out[ ]:
[<matplotlib.lines.Line2D at 0x7f5b89c3b6a0>]

### Subplot¶

Function plt.subplot() lets you have multiple graphs on the same picture.

plt.subplot(nrows, ncols, index) specifies that you will have nrows by ncols grid of plots, and right now you choose subplot number index (starting from 1, until nrows*ncols)

In [ ]:
# Specify that you will have subplots:
# 2 rows, 1 column, now drawing in the first (top) one
plt.subplot(2, 1, 1)
xx = np.linspace(-1, 2, 10)
plt.plot(xx, np.sin(xx))

# If you don't have many plots, you can write all three arguments as one number
# Now drawing in the second one
plt.subplot(212)
plt.plot(xx, np.cos(xx))

Out[ ]:
[<matplotlib.lines.Line2D at 0x7f5b896fd430>]
In [ ]:
# Arguments are data columns (series)
# Format is 's', means "don't draw lines, draw square markers"
plt.subplot(311)
plt.plot(df.salary, df.sales, 's')
# Exactly the same as above, but circle markers
plt.subplot(312)
plt.plot(df['salary'], df['sales'], 'o')
# Arguments are data frames
plt.subplot(313)
plt.plot(df[['salary']],df[['sales']], '.')

Out[ ]:
[<matplotlib.lines.Line2D at 0x7f5b891343d0>]

### Multiple columns in Y¶

Slight problem is you have to assign legend labels manually

In [ ]:
plt.subplot(211)
x = [1, 2, 3]
y = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
plt.plot(x, y, 'o')
plt.subplot(212)
plt.plot(df[['salary']],df[['training level', 'work experience']], 'o')
plt.legend(['training level','work experience'])

Out[ ]:
<matplotlib.legend.Legend at 0x7f5b88c9da90>

### Histogram¶

In [ ]:
# default histogram
plt.subplot(311)
plt.hist(df.salary)
# More bins: more details can be seen, but the histogram is more random
plt.subplot(312)
plt.hist(df.salary, bins=30)
plt.ylabel('Number of appearances')
# density makes a histogram count items relative to each other
# instead of counting the number of appearances
plt.subplot(313)
plt.hist(df.salary, bins=30, density=True)
plt.ylabel('Density')

Out[ ]:
Text(0, 0.5, 'Density')

You can compute the values of a histogram with numpy and then use plt.bar(positions, height) to draw the histogram bars

In [ ]:
# Drawing a histogram manually

# compute histogram with numpy
# output is heights of bins (n values) and edges of bins (n+1 values)
height, edges = np.histogram(df.salary)
# Remove the last edge to have n values
# Now each edge corresponds to a bin, and is a left edge of the bin
left_edges = edges[:-1]
# Compute the width of a bin as difference between any two edges
width = edges[1] - edges[0]
# Arguments are:
# bar positions, bar heights, bar width (a single number), alignment
# Default alignment is 'center'
# We have left edge positions, not center positions, so use 'edge' here.
plt.bar(left_edges, height, width=width, align='edge')

# You can also use these values for any other kind of plot
bin_centers = left_edges + width/2
plt.plot(bin_centers, height, 'r')

Out[ ]:
[<matplotlib.lines.Line2D at 0x7f5b82d05490>]

### Pie chart¶

In [ ]:
import matplotlib.pyplot as plt

# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Frogs', 'Hogs', 'Dogs', 'Logs'
sizes = [15, 30, 45, 10]

fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
# autopct puts labels inside the chart

plt.show()


In [ ]:
import seaborn as sns

In [ ]:
sns.set_theme()

In [ ]:
# Adding hue argument to pairplot
sns.pairplot(df, hue='level of education')

Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f5b7e8b8730>
In [ ]:
# jointplot as a more complex scatterplot
sns.jointplot(data=df, x="salary", y="sales", hue="level of education")

Out[ ]:
<seaborn.axisgrid.JointGrid at 0x7f5b8929d2e0>
In [ ]:
# lmplot as a more complex scatterplot
sns.lmplot(data=df, x="salary", y="sales", hue="level of education")

Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7f5b7e8b8430>

Plot how salary depends on level of education. Use catplot and boxplot from Seaborn.

Plot 3 pie charts showing the distribution of division for the following cases:

1. All employees
2. Education is 'high school'
3. Salary is less than 100000
In [ ]:
# useful methods:
print(df['level of education'].unique())
sales_index = df['sales'] < 200000
print(sales_index.sum())


Use parameters col and row of seaborn plot functions to plot a grid of salary/sales plots.

Each separate figure in the grid should be filtered by division and level of education.

Plot the same graph as in task 24.1 but remove the outliers - the values of salary that are too far away from the average.
Remove those who are more than 2 standard deviations (.std()s) away from the mean.
Compute average salary for each combination of work experience and level of training.