- Find and download a dataset.
- Check descriptive statistics of the dataset: mean, median, standard deviation for some of the fields (at least 3 fields).
- Plot graphs for some of the fields (at least 3 fields)
- Do at least 2 comparisons of values of the fields filtered by some condition (an example of this will be given later).
- Describe in 2-3 paragraphs, what kind of dataset it is. What data is there, what patterns you noticed in the data.

Full project instructions will be available soon on the wiki.

https://dataverse.harvard.edu/

https://datasetsearch.research.google.com/

https://github.com/awesomedata/awesome-public-datasets

Other sources. Useful search keywords are "open dataset" or "public dataset".

For the purposes of the project, you should select datasets from areas where "data" is a table of meaningful numbers, instead of something complex like fMRI data, images, eye-tracking data and so on.

Numpy is a 3rd party module that provides tools for working with matrices and arrays and performing complex computations.

Its documentation can be found here: https://numpy.org/doc/1.21/

The main object of NumPy is an n-dimensional array.

A 1-dimensional array is a vector. It is similar to a Python list, but it can only hold numbers of the same type.

A 2-dimensional array is a matrix.

In [ ]:

```
import numpy as np
```

In [ ]:

```
# np.array converts a list of lists of numbers into a numpy array
a = np.array([[1.5, 0, 1], [0, 2, 1]])
z = np.zeros((3,4))
print(a)
print(z)
```

In [ ]:

```
print(a.ndim) # number of dimensions
print(a.shape) # number of elements along each dimension
print(a.size) # number of elements overall
print(a.T) # transposing a matrix
```

In [ ]:

```
a_single = np.array([1, 2, 9])
a_double = np.array([[1, 2, 9]])
print(a_single)
print(a_double)
print(a_single.T)
print(a_double.T)
print(a_single.shape)
print(a_double.shape)
# making vectors into 1-by-n 2-dimensional arrays simplifies working with them in the context of matrices.
print(a_double.T * a_double)
print(a_single * a_single.T)
```

In [ ]:

```
ab = np.random.rand(3, 4)*10
print(ab)
print()
print()
# Indexing is usually done with tuples
ab_slice = ab[1:,:3]
print(ab_slice)
indices = ab[0, :] > 4
print(indices)
print(ab[:, indices])
```

NumPy ndarrays support basic mathematical operations and perform them much faster than Python lists.

There is a module `csv`

that makes it easier to read and write csv files. However, to work with data itself you have to use standard Python functions.

In [ ]:

```
### Writing a csv file
import csv
data_list = [
[1, 2, 3],
[4, 5, 6]
]
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(data_list)
```

In [ ]:

```
### Reading a csv file
import csv
data = []
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
data.append(row)
print(row)
```

Another common data type is json. It stores data similarly to the Python syntax. You can load it with module `json`

.

In [ ]:

```
### Writing json
import json
data = ['foo', {'bar': ('baz', None, 1.0, 2)}]
with open('some.json', 'w') as f:
json.dump(data, f)
```

In [ ]:

```
### Reading json
import json
with open('some.json') as f:
data = json.load(f)
print(data)
print(data[1])
```

In the file `sales.csv`

(link) there is a fictional set of sales data for employees of some company.

Load this dataset using the module `csv`

. Convert the numerical parts of data into numpy arrays: training level, work experience, salary, sales.

Compute a numpy array for `efficiency`

that is defined as sales divided by salary. Use only array operations, no loops.

Use the function `np.argmax(a)`

to find which employees have the highest salary, sales value and efficiency.

Use indexing with logical expressions `a[a < 5]`

to find a list of salaries of employees whose work experience is greater than 4 years.

Do not change the order of any of your arrays and use that fact. For each employee, the index in each array is always the same

`pandas`

is a module that allows you to perform many tasks related to data analysis.

The main format of storing data in pandas is DataFrame.

DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.

In [ ]:

```
import pandas as pd
```

In [ ]:

```
df_ex_no_labels = pd.DataFrame(ab[1,:])
print(df_ex_no_labels)
```

In [ ]:

```
df_ex = pd.DataFrame(ab[:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)
```

In [ ]:

```
# Usually a data frame is read from a file
# comma separated values
data = '''
Example 1,Example 2,Example3
1,2,3
2,51,35
3,0,0
4,50,25
'''
with open('df.csv', 'w') as df_file:
df_file.write(data)
df = pd.read_csv("df.csv")
print(df)
```

In [ ]:

```
df2 = pd.read_csv('https://gist.githubusercontent.com/l8doku/eded81e7f66e504f96117ef90ed28dcf/raw/e286e724378ea8ad47d7ee6ccd245084d7a797cd/example_data.csv')
print(df2)
```

In [ ]:

```
df2.columns
```

Out[ ]:

In [ ]:

```
df2[['Provider', 'Trend_in_One_Month']]
```

Out[ ]:

In [ ]:

```
```