pandas
is a module that allows you to perform many tasks related to data analysis.
The main format of storing data in pandas is DataFrame.
PyCharm:
File (or PyCharm on MacOS) -> Settings -> Project: [name of the project] -> Python Interpreter -> +
-> type "pandas" -> Install Package
# jupyter:
!pip install pandas
DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.
import pandas as pd
import numpy as np
ab = np.random.rand(3, 4)
print(ab)
df_ex_no_labels = pd.DataFrame(ab[1,:])
print(df_ex_no_labels)
You can add column names to the dataframe. Dataframes always have indices (row names) and column names.
df_ex = pd.DataFrame(ab[:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)
Pandas can create a dataframe directly from a file. It is smart enough to recognize a header of a csv file and treat it separately.
# Usually a data frame is read from a file
# comma separated values
data = '''
Example 1,Example 2,Example3
1,2,3
2,51,35
3,0,0
4,50,25
'''
with open('df.csv', 'w') as df_file:
df_file.write(data)
df = pd.read_csv("df.csv")
print(df)
df2 = pd.read_csv('https://gist.githubusercontent.com/l8doku/d3d8a8dfb55482f3371a517dc8b38d1a/raw/f78cb5cb49f825173b1ca19478fa0ff2d1efad2e/sales.csv')
print(df2)
df2.columns
To index columns you have to use two opening and closing brackets. You index them by passing a list of all column names you want to consider.
df2.dtypes
df2[['division', 'training level']]
# quick summary of useful statistics
df2.describe()
Here the example from last workshop is repeated: compare sales and salaries of trained vs untrained employees.
not_trained_indices = df2['training level'] == 0
trained_indices = df2['training level'] > 0
print(not_trained_indices)
# converting True to 1 and False to 0
print(not_trained_indices.sum())
print(trained_indices.sum())
trained = df2[trained_indices]
not_trained = df2[not_trained_indices]
print(trained)
print(trained['salary'].mean())
print(not_trained['salary'].mean())
trained_hardware = trained[trained['division'] == 'computer hardware']
print(trained_hardware['salary'].mean())
print(trained_hardware['salary'].std())
print(trained['salary'].mean())
print(trained_hardware)
PyCharm:
File (or PyCharm on MacOS) -> Settings -> Project: [name of the project] -> Python Interpreter -> +
-> type "matplotlib" -> Install Package
# Jupyter:
!pip install matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
?plt.plot
import numpy as np
def f(x):
return x**2
xx = np.linspace(-1, 2, 50)
# X, Y
plt.plot(xx, f(xx))
# data, format
plt.plot(trained['salary'], 'H')
# scaterplot
plt.plot(trained['salary'], trained['sales'], '.')
?plt.hist
plt.hist(trained['salary'], bins=20, density=True)
plt.hist(not_trained['salary'], bins=20, color='#11111166', density=True)
Load sales.csv
using the link to github.
Repeat the tasks 1 and 2 from workshop 21 using Pandas. You can also just divide columns by each other using a mathematical expression.
Plot efficiency as a function of work experience.
Plot it for two different subsets of data: division is "computer hardware" or division is everything else.
Plot scatterplots "salary vs sales" for employees with low work experience (0-1) and high work experience (2+) on the same graph.
# scatterplot
plt.plot(trained['salary'], trained['sales'], '.')
plt.plot(not_trained['salary'], not_trained['sales'], '.r')
Plot efficiencies from task 22.2 as histograms. Again, separate them by division. Use the code from the histogram example to make them normalized (density=True
) and make the top one transparent (color='#11111166
)