Workshop 22. Pandas. Visualization

Pandas

pandas is a module that allows you to perform many tasks related to data analysis.

The main format of storing data in pandas is DataFrame.

PyCharm:

File (or PyCharm on MacOS) -> Settings -> Project: [name of the project] -> Python Interpreter -> + -> type "pandas" -> Install Package

In [ ]:
# jupyter:
!pip install pandas

DataFrame

DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.

In [1]:
import pandas as pd
import numpy as np
In [ ]:
ab = np.random.rand(3, 4)
print(ab)
[[0.97882129 0.97638641 0.38531908 0.58640792]
 [0.7876458  0.09795813 0.04826831 0.19525838]
 [0.28949706 0.34759189 0.15193296 0.9422557 ]]
In [ ]:
df_ex_no_labels = pd.DataFrame(ab[1,:])
print(df_ex_no_labels)
          0
0  0.064851
1  0.745081
2  0.454588
3  0.129970

You can add column names to the dataframe. Dataframes always have indices (row names) and column names.

In [ ]:
df_ex = pd.DataFrame(ab[:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)
    Value 1  Important number   Value 2  Another value
0  6.515913          1.360627  1.835218       7.684258
1  6.249636          9.224460  2.980657       1.659288
2  1.289803          0.206213  0.392303       2.870422

Creating dataframes from files

Pandas can create a dataframe directly from a file. It is smart enough to recognize a header of a csv file and treat it separately.

In [ ]:
# Usually a data frame is read from a file
# comma separated values
data = '''
Example 1,Example 2,Example3
1,2,3
2,51,35
3,0,0
4,50,25

'''

with open('df.csv', 'w') as df_file:
    df_file.write(data)

df = pd.read_csv("df.csv")
print(df)
   Example 1  Example 2  Example3
0          1          2         3
1          2         51        35
2          3          0         0
3          4         50        25
In [ ]:
df2 = pd.read_csv('https://gist.githubusercontent.com/l8doku/d3d8a8dfb55482f3371a517dc8b38d1a/raw/f78cb5cb49f825173b1ca19478fa0ff2d1efad2e/sales.csv')
print(df2)
              division  level of education  training level  work experience  \
0    computer hardware        some college               1                5   
1          peripherals   bachelor's degree               1                4   
2          peripherals         high school               0                5   
3      office supplies  associate's degree               1                5   
4          peripherals        some college               1                5   
..                 ...                 ...             ...              ...   
195    office supplies  associate's degree               0                5   
196    office supplies        some college               0                7   
197           printers        some college               3               11   
198        peripherals        some college               0                7   
199  computer hardware   bachelor's degree               1                7   

     salary   sales  
0     81769  302611  
1     89792  274336  
2     70797  256854  
3     82236  279598  
4     73725  261014  
..      ...     ...  
195   79457  234788  
196   84363  314430  
197  128755  630671  
198   86113  306482  
199  102531  340807  

[200 rows x 6 columns]
In [ ]:
df2.columns
Out[ ]:
Index(['division', 'level of education', 'training level', 'work experience',
       'salary', 'sales'],
      dtype='object')

To index columns you have to use two opening and closing brackets. You index them by passing a list of all column names you want to consider.

In [ ]:
df2.dtypes
Out[ ]:
division              object
level of education    object
training level         int64
work experience        int64
salary                 int64
sales                  int64
dtype: object
In [ ]:
df2[['division', 'training level']]
Out[ ]:
division training level
0 computer hardware 1
1 peripherals 1
2 peripherals 0
3 office supplies 1
4 peripherals 1
... ... ...
195 office supplies 0
196 office supplies 0
197 printers 3
198 peripherals 0
199 computer hardware 1

200 rows × 2 columns

In [ ]:
# quick summary of useful statistics
df2.describe()
Out[ ]:
training level work experience salary sales
count 200.000000 200.000000 200.000000 200.000000
mean 1.250000 6.015000 90379.330000 338438.915000
std 0.922908 2.850509 16616.093562 109041.854566
min 0.000000 0.000000 56514.000000 73739.000000
25% 1.000000 4.000000 77932.250000 260936.250000
50% 1.000000 6.000000 89941.500000 330494.000000
75% 2.000000 8.000000 100265.750000 398611.750000
max 3.000000 15.000000 139374.000000 681512.000000

Analysis example

Here the example from last workshop is repeated: compare sales and salaries of trained vs untrained employees.

In [ ]:
not_trained_indices = df2['training level'] == 0
trained_indices = df2['training level'] > 0

print(not_trained_indices)
# converting True to 1 and False to 0
print(not_trained_indices.sum())
print(trained_indices.sum())
0      False
1      False
2       True
3      False
4      False
       ...  
195     True
196     True
197    False
198     True
199    False
Name: training level, Length: 200, dtype: bool
45
155
In [ ]:
trained = df2[trained_indices]
not_trained = df2[not_trained_indices]
print(trained)
              division  level of education  training level  work experience  \
0    computer hardware        some college               1                5   
1          peripherals   bachelor's degree               1                4   
3      office supplies  associate's degree               1                5   
4          peripherals        some college               1                5   
5      office supplies        some college               1                7   
..                 ...                 ...             ...              ...   
192        peripherals         high school               2               11   
193           printers  associate's degree               1                1   
194  computer software  associate's degree               2                3   
197           printers        some college               3               11   
199  computer hardware   bachelor's degree               1                7   

     salary   sales  
0     81769  302611  
1     89792  274336  
3     82236  279598  
4     73725  261014  
5     83982  354550  
..      ...     ...  
192  122982  587746  
193   70223  184336  
194   83493  244925  
197  128755  630671  
199  102531  340807  

[155 rows x 6 columns]
In [ ]:
print(trained['salary'].mean())
print(not_trained['salary'].mean())
92660.47096774193
82522.06666666667
In [ ]:
trained_hardware = trained[trained['division'] == 'computer hardware']
print(trained_hardware['salary'].mean())
print(trained_hardware['salary'].std())
print(trained['salary'].mean())
print(trained_hardware)
98138.54545454546
12865.58998601742
92660.47096774193
              division  level of education  training level  work experience  \
0    computer hardware        some college               1                5   
9    computer hardware   bachelor's degree               3                3   
18   computer hardware        some college               1                7   
28   computer hardware  associate's degree               3                8   
45   computer hardware  associate's degree               2                7   
61   computer hardware  associate's degree               1                5   
72   computer hardware   bachelor's degree               1                5   
75   computer hardware   bachelor's degree               1                8   
81   computer hardware  associate's degree               1               12   
102  computer hardware  associate's degree               1                7   
107  computer hardware  associate's degree               2                3   
114  computer hardware  associate's degree               2                3   
129  computer hardware        some college               2                9   
150  computer hardware   bachelor's degree               1               11   
152  computer hardware   bachelor's degree               1                7   
154  computer hardware         high school               1                7   
156  computer hardware   bachelor's degree               2                6   
161  computer hardware        some college               1               12   
178  computer hardware        some college               1                6   
185  computer hardware        some college               2                3   
186  computer hardware  associate's degree               1                8   
199  computer hardware   bachelor's degree               1                7   

     salary   sales  
0     81769  302611  
9     88989  343869  
18    93598  348599  
28   110483  457251  
45    96110  391102  
61    90091  290524  
72    94003  330317  
75   111621  386927  
81   121785  547853  
102  104019  345814  
107   87614  264659  
114   84095  279470  
129  112185  507503  
150  119221  441109  
152   99134  381344  
154   82721  322950  
156   96782  373730  
161  117378  470353  
178   88875  274675  
185   77544  302899  
186   98500  328798  
199  102531  340807  

Visualization

Matplotlib

PyCharm:

File (or PyCharm on MacOS) -> Settings -> Project: [name of the project] -> Python Interpreter -> + -> type "matplotlib" -> Install Package

In [ ]:
# Jupyter:
!pip install matplotlib
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib in /usr/lib/python3.9/site-packages (3.4.3)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3.9/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3.9/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: numpy>=1.16 in /usr/lib/python3.9/site-packages (from matplotlib) (1.21.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/lib/python3.9/site-packages (from matplotlib) (8.4.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3.9/site-packages (from matplotlib) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /usr/lib/python3.9/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six in /usr/lib/python3.9/site-packages (from cycler>=0.10->matplotlib) (1.16.0)
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
?plt.plot
In [ ]:
import numpy as np
def f(x):
    return x**2

xx = np.linspace(-1, 2, 50)
# X, Y
plt.plot(xx, f(xx))
Out[ ]:
[<matplotlib.lines.Line2D at 0x7f15e6a8f220>]
In [ ]:
# data, format
plt.plot(trained['salary'], 'H')
Out[ ]:
[<matplotlib.lines.Line2D at 0x7f15e6b1bd90>]
In [ ]:
# scaterplot
plt.plot(trained['salary'], trained['sales'], '.')
Out[ ]:
[<matplotlib.lines.Line2D at 0x7f15e69d1b20>]
In [ ]:
?plt.hist
In [ ]:
plt.hist(trained['salary'], bins=20, density=True)
plt.hist(not_trained['salary'], bins=20, color='#11111166', density=True)
Out[ ]:
(array([2., 1., 2., 3., 1., 2., 4., 4., 2., 2., 5., 2., 1., 5., 3., 2., 0.,
        1., 0., 3.]),
 array([ 56514. ,  59225.8,  61937.6,  64649.4,  67361.2,  70073. ,
         72784.8,  75496.6,  78208.4,  80920.2,  83632. ,  86343.8,
         89055.6,  91767.4,  94479.2,  97191. ,  99902.8, 102614.6,
        105326.4, 108038.2, 110750. ]),
 <BarContainer object of 20 artists>)

Tasks

Task 22.1

Load sales.csv using the link to github.

Repeat the tasks 1 and 2 from workshop 21 using Pandas. You can also just divide columns by each other using a mathematical expression.

Task 22.2

Plot efficiency as a function of work experience.

Plot it for two different subsets of data: division is "computer hardware" or division is everything else.

Task 22.3

Plot scatterplots "salary vs sales" for employees with low work experience (0-1) and high work experience (2+) on the same graph.

In [ ]:
# scatterplot
plt.plot(trained['salary'], trained['sales'], '.')
plt.plot(not_trained['salary'], not_trained['sales'], '.r') 

Task 22.4

Plot efficiencies from task 22.2 as histograms. Again, separate them by division. Use the code from the histogram example to make them normalized (density=True) and make the top one transparent (color='#11111166)