Workshop 21. NumPy. Pandas. Data analysis project.

Data Analysis Project

  1. Find and download a dataset.
  2. Check descriptive statistics of the dataset: mean, median, standard deviation for some of the fields (at least 3 fields).
  3. Plot graphs for some of the fields (at least 3 fields)
  4. Do at least 2 comparisons of values of the fields filtered by some condition (an example of this will be given later).
  5. Describe in 2-3 paragraphs, what kind of dataset it is. What data is there, what patterns you noticed in the data.

Full project instructions will be available soon on the wiki.

Data sources

https://www.kaggle.com/

https://dataverse.harvard.edu/

https://datasetsearch.research.google.com/

https://github.com/awesomedata/awesome-public-datasets

Other sources. Useful search keywords are "open dataset" or "public dataset".

For the purposes of the project, you should select datasets from areas where "data" is a table of meaningful numbers, instead of something complex like fMRI data, images, eye-tracking data and so on.

NumPy

Numpy is a 3rd party module that provides tools for working with matrices and arrays and performing complex computations.

Its documentation can be found here: https://numpy.org/doc/1.21/

The main object of NumPy is an n-dimensional array.

A 1-dimensional array is a vector. It is similar to a Python list, but it can only hold numbers of the same type.

A 2-dimensional array is a matrix.

In [ ]:
import numpy as np
In [ ]:
# np.array converts a list of lists of numbers into a numpy array
a = np.array([[1.5, 0, 1], [0, 2, 1]])
z = np.zeros((3,4))
print(a)
print(z)
[[1.5 0.  1. ]
 [0.  2.  1. ]]
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
In [ ]:
print(a.ndim) # number of dimensions
print(a.shape) # number of elements along each dimension
print(a.size) # number of elements overall
print(a.T) # transposing a matrix
2
(2, 3)
6
[[1.5 0. ]
 [0.  2. ]
 [1.  1. ]]
In [ ]:
a_single = np.array([1, 2, 9])
a_double = np.array([[1, 2, 9]])
print(a_single)
print(a_double)
print(a_single.T)
print(a_double.T)
print(a_single.shape)
print(a_double.shape)

# making vectors into 1-by-n 2-dimensional arrays simplifies working with them in the context of matrices.

print(a_double.T * a_double)
print(a_single * a_single.T)
[1 2 9]
[[1 2 9]]
[1 2 9]
[[1]
 [2]
 [9]]
(3,)
(1, 3)
[[ 1  2  9]
 [ 2  4 18]
 [ 9 18 81]]
[ 1  4 81]
In [ ]:
ab = np.random.rand(3, 4)*10
print(ab)
print()
print()

# Indexing is usually done with tuples
ab_slice = ab[1:,:3]
print(ab_slice)

indices = ab[0, :] > 4
print(indices)
print(ab[:, indices])
[[7.46173146 6.62217718 2.60960204 4.13956425]
 [0.73936542 1.88682724 7.96806865 9.35356483]
 [4.44482064 2.56019377 6.12784082 8.53571277]]


[[0.73936542 1.88682724 7.96806865]
 [4.44482064 2.56019377 6.12784082]]
[ True  True False  True]
[[7.46173146 6.62217718 4.13956425]
 [0.73936542 1.88682724 9.35356483]
 [4.44482064 2.56019377 8.53571277]]

NumPy ndarrays support basic mathematical operations and perform them much faster than Python lists.

Before Pandas. Loading data from files

csv

There is a module csv that makes it easier to read and write csv files. However, to work with data itself you have to use standard Python functions.

In [ ]:
### Writing a csv file

import csv

data_list = [
    [1, 2, 3],
    [4, 5, 6]
]
with open('some.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data_list)
In [ ]:
### Reading a csv file

import csv
data = []
with open('some.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(row)
        print(row)
['1', '2', '3']
['4', '5', '6']

json

Another common data type is json. It stores data similarly to the Python syntax. You can load it with module json.

In [ ]:
### Writing json

import json

data = ['foo', {'bar': ('baz', None, 1.0, 2)}]
with open('some.json', 'w') as f:
    json.dump(data, f)
    
In [ ]:
### Reading json

import json

with open('some.json') as f:
    data = json.load(f)
print(data)  
print(data[1])
['foo', {'bar': ['baz', None, 1.0, 2]}]
{'bar': ['baz', None, 1.0, 2]}

Task 1

In the file sales.csv (link) there is a fictional set of sales data for employees of some company.

Load this dataset using the module csv. Convert the numerical parts of data into numpy arrays: training level, work experience, salary, sales.

Compute a numpy array for efficiency that is defined as sales divided by salary. Use only array operations, no loops.

Use the function np.argmax(a) to find which employees have the highest salary, sales value and efficiency.

Task 2

Use indexing with logical expressions a[a < 5] to find a list of salaries of employees whose work experience is greater than 4 years.

Do not change the order of any of your arrays and use that fact. For each employee, the index in each array is always the same

Pandas

pandas is a module that allows you to perform many tasks related to data analysis.

The main format of storing data in pandas is DataFrame.

DataFrame

DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.

In [ ]:
import pandas as pd
In [ ]:
df_ex_no_labels = pd.DataFrame(ab[1,:])
print(df_ex_no_labels)
          0
0  0.739365
1  1.886827
2  7.968069
3  9.353565
In [ ]:
df_ex = pd.DataFrame(ab[:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)
    Value 1  Important number   Value 2  Another value
0  7.461731          6.622177  2.609602       4.139564
1  0.739365          1.886827  7.968069       9.353565
2  4.444821          2.560194  6.127841       8.535713
In [ ]:
# Usually a data frame is read from a file
# comma separated values
data = '''
Example 1,Example 2,Example3
1,2,3
2,51,35
3,0,0
4,50,25

'''

with open('df.csv', 'w') as df_file:
    df_file.write(data)

df = pd.read_csv("df.csv")
print(df)
   Example 1  Example 2  Example3
0          1          2         3
1          2         51        35
2          3          0         0
3          4         50        25
In [ ]:
df2 = pd.read_csv('https://gist.githubusercontent.com/l8doku/eded81e7f66e504f96117ef90ed28dcf/raw/e286e724378ea8ad47d7ee6ccd245084d7a797cd/example_data.csv')
print(df2)
   Provider Station_ID Composed_Indicator  Overall_Availability  \
0       ANA   51240000               0.35                 19.02   
1       ANA   51280000                  -                  0.08   
2       BOM    102101A               0.92                 80.09   
3       BOM    104001A               0.79                 76.18   
4       BOM     304499               0.94                 84.39   
5       BOM     402206               0.86                 66.85   
6       BOM   A0030501               0.87                 70.94   
7      CHDP          1               0.56                 32.05   
8      CHDP        100               0.65                 41.54   
9      CHDP        101               0.67                 43.48   
10     CHDP        102                0.6                 39.41   
11     CHDP        103               0.67                 44.07   
12     CHDP         57                  -                  7.76   
13     CHDP         58               0.67                 38.86   
14     CHDP          6               0.72                 44.07   

    Longest_Availability_without_Gaps  Continuity_in_Available_Data  \
0                                2.17                         11.42   
1                                0.07                         88.89   
2                               80.09                        100.00   
3                               76.18                        100.00   
4                               84.39                        100.00   
5                               66.85                        100.00   
6                               70.94                        100.00   
7                                6.49                         20.26   
8                               25.28                         60.86   
9                               35.02                         80.54   
10                              18.28                         46.38   
11                              42.37                         96.16   
12                               7.76                        100.00   
13                               7.05                         18.15   
14                              42.37                         96.16   

    Minimum_Relative_Availability_for_a_Month  Ratio_of_Outliers  \
0                                       94.45               0.93   
1                                        0.00               0.00   
2                                       99.47               0.74   
3                                       98.75               0.67   
4                                       98.41               0.55   
5                                       98.88               0.76   
6                                       97.96               0.69   
7                                       96.90               0.19   
8                                       97.83               0.26   
9                                       97.08               0.83   
10                                      94.61               0.62   
11                                      99.98               0.75   
12                                      87.33               0.12   
13                                      97.77               1.03   
14                                      99.98               0.25   

   Homogeneity_of_Yearly_Averages Trend_in_Annual_Flows Trend_in_One_Month  
0                               1           Significant        Significant  
1                               -                     -                  -  
2                               0       Not_significant    Not_significant  
3                               0       Not_significant        Significant  
4                               0       Not_significant    Not_significant  
5                               1       Not_significant    Not_significant  
6                               1       Not_significant    Not_significant  
7                               0       Not_significant        Significant  
8                               0       Not_significant        Significant  
9                               0       Not_significant        Significant  
10                              0       Not_significant        Significant  
11                              1       Not_significant        Significant  
12                              -                     -                  -  
13                              0       Not_significant    Not_significant  
14                              0       Not_significant        Significant  
In [ ]:
df2.columns
Out[ ]:
Index(['Provider', 'Station_ID', 'Composed_Indicator', 'Overall_Availability',
       'Longest_Availability_without_Gaps', 'Continuity_in_Available_Data',
       'Minimum_Relative_Availability_for_a_Month', 'Ratio_of_Outliers',
       'Homogeneity_of_Yearly_Averages', 'Trend_in_Annual_Flows',
       'Trend_in_One_Month'],
      dtype='object')
In [ ]:
df2[['Provider', 'Trend_in_One_Month']]
Out[ ]:
Provider Trend_in_One_Month
0 ANA Significant
1 ANA -
2 BOM Not_significant
3 BOM Significant
4 BOM Not_significant
5 BOM Not_significant
6 BOM Not_significant
7 CHDP Significant
8 CHDP Significant
9 CHDP Significant
10 CHDP Significant
11 CHDP Significant
12 CHDP -
13 CHDP Not_significant
14 CHDP Significant
In [ ]: