Full project instructions will be available soon on the wiki.
https://dataverse.harvard.edu/
https://datasetsearch.research.google.com/
https://github.com/awesomedata/awesome-public-datasets
Other sources. Useful search keywords are "open dataset" or "public dataset".
For the purposes of the project, you should select datasets from areas where "data" is a table of meaningful numbers, instead of something complex like fMRI data, images, eye-tracking data and so on.
Numpy is a 3rd party module that provides tools for working with matrices and arrays and performing complex computations.
Its documentation can be found here: https://numpy.org/doc/1.21/
The main object of NumPy is an n-dimensional array.
A 1-dimensional array is a vector. It is similar to a Python list, but it can only hold numbers of the same type.
A 2-dimensional array is a matrix.
import numpy as np
# np.array converts a list of lists of numbers into a numpy array
a = np.array([[1.5, 0, 1], [0, 2, 1]])
z = np.zeros((3,4))
print(a)
print(z)
print(a.ndim) # number of dimensions
print(a.shape) # number of elements along each dimension
print(a.size) # number of elements overall
print(a.T) # transposing a matrix
a_single = np.array([1, 2, 9])
a_double = np.array([[1, 2, 9]])
print(a_single)
print(a_double)
print(a_single.T)
print(a_double.T)
print(a_single.shape)
print(a_double.shape)
# making vectors into 1-by-n 2-dimensional arrays simplifies working with them in the context of matrices.
print(a_double.T * a_double)
print(a_single * a_single.T)
ab = np.random.rand(3, 4)*10
print(ab)
print()
print()
# Indexing is usually done with tuples
ab_slice = ab[1:,:3]
print(ab_slice)
indices = ab[0, :] > 4
print(indices)
print(ab[:, indices])
NumPy ndarrays support basic mathematical operations and perform them much faster than Python lists.
There is a module csv
that makes it easier to read and write csv files. However, to work with data itself you have to use standard Python functions.
### Writing a csv file
import csv
data_list = [
[1, 2, 3],
[4, 5, 6]
]
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(data_list)
### Reading a csv file
import csv
data = []
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
data.append(row)
print(row)
Another common data type is json. It stores data similarly to the Python syntax. You can load it with module json
.
### Writing json
import json
data = ['foo', {'bar': ('baz', None, 1.0, 2)}]
with open('some.json', 'w') as f:
json.dump(data, f)
### Reading json
import json
with open('some.json') as f:
data = json.load(f)
print(data)
print(data[1])
In the file sales.csv
(link) there is a fictional set of sales data for employees of some company.
Load this dataset using the module csv
. Convert the numerical parts of data into numpy arrays: training level, work experience, salary, sales.
Compute a numpy array for efficiency
that is defined as sales divided by salary. Use only array operations, no loops.
Use the function np.argmax(a)
to find which employees have the highest salary, sales value and efficiency.
Use indexing with logical expressions a[a < 5]
to find a list of salaries of employees whose work experience is greater than 4 years.
Do not change the order of any of your arrays and use that fact. For each employee, the index in each array is always the same
pandas
is a module that allows you to perform many tasks related to data analysis.
The main format of storing data in pandas is DataFrame.
DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.
import pandas as pd
df_ex_no_labels = pd.DataFrame(ab[1,:])
print(df_ex_no_labels)
df_ex = pd.DataFrame(ab[:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)
# Usually a data frame is read from a file
# comma separated values
data = '''
Example 1,Example 2,Example3
1,2,3
2,51,35
3,0,0
4,50,25
'''
with open('df.csv', 'w') as df_file:
df_file.write(data)
df = pd.read_csv("df.csv")
print(df)
df2 = pd.read_csv('https://gist.githubusercontent.com/l8doku/eded81e7f66e504f96117ef90ed28dcf/raw/e286e724378ea8ad47d7ee6ccd245084d7a797cd/example_data.csv')
print(df2)
df2.columns
df2[['Provider', 'Trend_in_One_Month']]