Workshop 22. Pandas. Visualization¶

Pandas¶

pandas is a module that allows you to perform many tasks related to data analysis.

The main format of storing data in pandas is DataFrame.

PyCharm:

File (or PyCharm on MacOS) -> Settings -> Project: [name of the project] -> Python Interpreter -> + -> type "pandas" -> Install Package

# jupyter:
!pip install pandas

DataFrame¶

DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.

import pandas as pd
import numpy as np

ab = np.random.rand(3, 4)
print(ab)

[[0.97882129 0.97638641 0.38531908 0.58640792]
 [0.7876458  0.09795813 0.04826831 0.19525838]
 [0.28949706 0.34759189 0.15193296 0.9422557 ]]

df_ex_no_labels = pd.DataFrame(ab[1,:])
print(df_ex_no_labels)

          0
0  0.064851
1  0.745081
2  0.454588
3  0.129970

You can add column names to the dataframe. Dataframes always have indices (row names) and column names.

df_ex = pd.DataFrame(ab[:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)

    Value 1  Important number   Value 2  Another value
0  6.515913          1.360627  1.835218       7.684258
1  6.249636          9.224460  2.980657       1.659288
2  1.289803          0.206213  0.392303       2.870422

Creating dataframes from files¶

Pandas can create a dataframe directly from a file. It is smart enough to recognize a header of a csv file and treat it separately.

# Usually a data frame is read from a file
# comma separated values
data = '''
Example 1,Example 2,Example3
1,2,3
2,51,35
3,0,0
4,50,25

'''

with open('df.csv', 'w') as df_file:
    df_file.write(data)

df = pd.read_csv("df.csv")
print(df)

   Example 1  Example 2  Example3
0          1          2         3
1          2         51        35
2          3          0         0
3          4         50        25

df2 = pd.read_csv('https://gist.githubusercontent.com/l8doku/d3d8a8dfb55482f3371a517dc8b38d1a/raw/f78cb5cb49f825173b1ca19478fa0ff2d1efad2e/sales.csv')
print(df2)

              division  level of education  training level  work experience  \
0    computer hardware        some college               1                5   
1          peripherals   bachelor's degree               1                4   
2          peripherals         high school               0                5   
3      office supplies  associate's degree               1                5   
4          peripherals        some college               1                5   
..                 ...                 ...             ...              ...   
195    office supplies  associate's degree               0                5   
196    office supplies        some college               0                7   
197           printers        some college               3               11   
198        peripherals        some college               0                7   
199  computer hardware   bachelor's degree               1                7   

     salary   sales  
0     81769  302611  
1     89792  274336  
2     70797  256854  
3     82236  279598  
4     73725  261014  
..      ...     ...  
195   79457  234788  
196   84363  314430  
197  128755  630671  
198   86113  306482  
199  102531  340807  

[200 rows x 6 columns]

df2.columns

Index(['division', 'level of education', 'training level', 'work experience',
       'salary', 'sales'],
      dtype='object')

To index columns you have to use two opening and closing brackets. You index them by passing a list of all column names you want to consider.

df2.dtypes

division              object
level of education    object
training level         int64
work experience        int64
salary                 int64
sales                  int64
dtype: object

df2[['division', 'training level']]

# quick summary of useful statistics
df2.describe()

Analysis example¶

Here the example from last workshop is repeated: compare sales and salaries of trained vs untrained employees.

not_trained_indices = df2['training level'] == 0
trained_indices = df2['training level'] > 0

print(not_trained_indices)
# converting True to 1 and False to 0
print(not_trained_indices.sum())
print(trained_indices.sum())

0      False
1      False
2       True
3      False
4      False
       ...  
195     True
196     True
197    False
198     True
199    False
Name: training level, Length: 200, dtype: bool
45
155

trained = df2[trained_indices]
not_trained = df2[not_trained_indices]
print(trained)

              division  level of education  training level  work experience  \
0    computer hardware        some college               1                5   
1          peripherals   bachelor's degree               1                4   
3      office supplies  associate's degree               1                5   
4          peripherals        some college               1                5   
5      office supplies        some college               1                7   
..                 ...                 ...             ...              ...   
192        peripherals         high school               2               11   
193           printers  associate's degree               1                1   
194  computer software  associate's degree               2                3   
197           printers        some college               3               11   
199  computer hardware   bachelor's degree               1                7   

     salary   sales  
0     81769  302611  
1     89792  274336  
3     82236  279598  
4     73725  261014  
5     83982  354550  
..      ...     ...  
192  122982  587746  
193   70223  184336  
194   83493  244925  
197  128755  630671  
199  102531  340807  

[155 rows x 6 columns]

print(trained['salary'].mean())
print(not_trained['salary'].mean())

92660.47096774193
82522.06666666667

trained_hardware = trained[trained['division'] == 'computer hardware']
print(trained_hardware['salary'].mean())
print(trained_hardware['salary'].std())
print(trained['salary'].mean())
print(trained_hardware)

98138.54545454546
12865.58998601742
92660.47096774193
              division  level of education  training level  work experience  \
0    computer hardware        some college               1                5   
9    computer hardware   bachelor's degree               3                3   
18   computer hardware        some college               1                7   
28   computer hardware  associate's degree               3                8   
45   computer hardware  associate's degree               2                7   
61   computer hardware  associate's degree               1                5   
72   computer hardware   bachelor's degree               1                5   
75   computer hardware   bachelor's degree               1                8   
81   computer hardware  associate's degree               1               12   
102  computer hardware  associate's degree               1                7   
107  computer hardware  associate's degree               2                3   
114  computer hardware  associate's degree               2                3   
129  computer hardware        some college               2                9   
150  computer hardware   bachelor's degree               1               11   
152  computer hardware   bachelor's degree               1                7   
154  computer hardware         high school               1                7   
156  computer hardware   bachelor's degree               2                6   
161  computer hardware        some college               1               12   
178  computer hardware        some college               1                6   
185  computer hardware        some college               2                3   
186  computer hardware  associate's degree               1                8   
199  computer hardware   bachelor's degree               1                7   

     salary   sales  
0     81769  302611  
9     88989  343869  
18    93598  348599  
28   110483  457251  
45    96110  391102  
61    90091  290524  
72    94003  330317  
75   111621  386927  
81   121785  547853  
102  104019  345814  
107   87614  264659  
114   84095  279470  
129  112185  507503  
150  119221  441109  
152   99134  381344  
154   82721  322950  
156   96782  373730  
161  117378  470353  
178   88875  274675  
185   77544  302899  
186   98500  328798  
199  102531  340807

Visualization¶

Matplotlib¶

PyCharm:

File (or PyCharm on MacOS) -> Settings -> Project: [name of the project] -> Python Interpreter -> + -> type "matplotlib" -> Install Package

# Jupyter:
!pip install matplotlib

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib in /usr/lib/python3.9/site-packages (3.4.3)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3.9/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3.9/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: numpy>=1.16 in /usr/lib/python3.9/site-packages (from matplotlib) (1.21.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/lib/python3.9/site-packages (from matplotlib) (8.4.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3.9/site-packages (from matplotlib) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /usr/lib/python3.9/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six in /usr/lib/python3.9/site-packages (from cycler>=0.10->matplotlib) (1.16.0)

import matplotlib.pyplot as plt
%matplotlib inline

?plt.plot

import numpy as np
def f(x):
    return x**2

xx = np.linspace(-1, 2, 50)
# X, Y
plt.plot(xx, f(xx))

[<matplotlib.lines.Line2D at 0x7f15e6a8f220>]

# data, format
plt.plot(trained['salary'], 'H')

[<matplotlib.lines.Line2D at 0x7f15e6b1bd90>]

# scaterplot
plt.plot(trained['salary'], trained['sales'], '.')

[<matplotlib.lines.Line2D at 0x7f15e69d1b20>]

?plt.hist

plt.hist(trained['salary'], bins=20, density=True)
plt.hist(not_trained['salary'], bins=20, color='#11111166', density=True)

(array([2., 1., 2., 3., 1., 2., 4., 4., 2., 2., 5., 2., 1., 5., 3., 2., 0.,
        1., 0., 3.]),
 array([ 56514. ,  59225.8,  61937.6,  64649.4,  67361.2,  70073. ,
         72784.8,  75496.6,  78208.4,  80920.2,  83632. ,  86343.8,
         89055.6,  91767.4,  94479.2,  97191. ,  99902.8, 102614.6,
        105326.4, 108038.2, 110750. ]),
 <BarContainer object of 20 artists>)

Tasks¶

Task 22.1¶

Load sales.csv using the link to github.

Repeat the tasks 1 and 2 from workshop 21 using Pandas. You can also just divide columns by each other using a mathematical expression.

Task 22.2¶

Plot efficiency as a function of work experience.

Plot it for two different subsets of data: division is "computer hardware" or division is everything else.

Task 22.3¶

Plot scatterplots "salary vs sales" for employees with low work experience (0-1) and high work experience (2+) on the same graph.

# scatterplot
plt.plot(trained['salary'], trained['sales'], '.')
plt.plot(not_trained['salary'], not_trained['sales'], '.r')

Task 22.4¶

Plot efficiencies from task 22.2 as histograms. Again, separate them by division. Use the code from the histogram example to make them normalized (density=True) and make the top one transparent (color='#11111166)

	training level	work experience	salary	sales
count	200.000000	200.000000	200.000000	200.000000
mean	1.250000	6.015000	90379.330000	338438.915000
std	0.922908	2.850509	16616.093562	109041.854566
min	0.000000	0.000000	56514.000000	73739.000000
25%	1.000000	4.000000	77932.250000	260936.250000
50%	1.000000	6.000000	89941.500000	330494.000000
75%	2.000000	8.000000	100265.750000	398611.750000
max	3.000000	15.000000	139374.000000	681512.000000

	division	training level
0	computer hardware	1
1	peripherals	1
2	peripherals	0
3	office supplies	1
4	peripherals	1
...	...	...
195	office supplies	0
196	office supplies	0
197	printers	3
198	peripherals	0
199	computer hardware	1