Workshop 23. Data analysis examples. Visualization.

In [ ]:
import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/l8doku/f291d09e88c866d3a044212b45cb5e23/raw/5d7e7aa3ca4fd03ab3c051b4adc304c4ceb01e81/titanic_train.csv')
# from titanic
In [ ]:
data.head(10)
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [ ]:
data.describe()
Out[ ]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Tasks and examples

Change the code in the corresponding cells or write your own

1. How many men / women were on board?

In [ ]:
# change this code
data['Age'].value_counts()
Out[ ]:
24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtype: int64

2. Print the distribution of Pclass (socioeconomic status). Additionally - the same distribution is only for men / women separately.

In [ ]:
# change this code
# see what happens in margins=True
pd.crosstab(data['SibSp'], data['Survived'], margins=False)
Out[ ]:
Survived 0 1
SibSp
0 398 210
1 97 112
2 15 13
3 12 4
4 15 3
5 5 0
8 7 0

3. How much did one passenger pay on average? Find the standard deviation of this value. Better to round to 2 decimal places.

In [ ]:
mean_fare = # your code, method .mean()
# std_fare = # same type of method
# median_fare = # same type of method
print(f'{mean_fare:.3}')
# other prints
1.11

4. Is it true that young people were more likely to survive than old people? Let the "young" be those who are under 30, and "old" - those who are over 60.

In [ ]:
# use the same logical statement indexing as on previous workshops to filter old and young
young = data[data['Age'] < 30]
# then use .mean() over the "Survived" column to get the average chance of survival.
# use print(f'{variable:.3}') to only output up to two decimal places

5. Is it true that women were more likely to survive than men?

In [ ]:
# use the same logic as in task 4
In [ ]:
# name has structure
# last_name, Mr. first_name (middle_name)
# three parts separated by spaces
def get_first_name(full_name):
    first_name = full_name.split()[2]
    return first_name
    
# use this column of data to solve the task
data['Name'].apply(get_first_name)
Out[ ]:
0           Owen
1           John
2          Laina
3        Jacques
4        William
         ...    
886       Juozas
887     Margaret
888    Catherine
889         Karl
890      Patrick
Name: Name, Length: 891, dtype: object

6.1. Is it true that people from third class (Pclass=3) have shorter surnames than people from other classes?

In [ ]:
# use logic similar to task 6
# create a new column with surname length.
# Then filter out passengers by class and find the average name length of each class
# you may use a lambda function instead of a regular one

7. Compare the distribution of ticket prices for the rescued and the victims.

In [ ]:
# your code to obtain a column of rescued passengers
# your code to obtain a column of victims
rescued.hist(color="green", label='Survived')
# histogram for victims
import matplotlib.pyplot as plt
plt.title('Ticket fare for survived passengers and victims')
plt.xlabel('Pounds')
plt.ylabel('Frequency')
plt.legend();

8*. How does the average age of men / women differ by class of service?

In [ ]:
# one way to solve:
# get different values for sex/class with data['Pclass'].unique()
# iterate over various values of unique sets
# filter the data by current iterations
# compute average age

# another way to solve: groupby
# for (pclass, sex), filtered_dt in data.groupby(['Pclass', 'Sex']):

Visualization. Seaborn

File - Settings (Preferences) - Project: [Project-name] - Python Interpreter - + - enter "seaborn" - Install Package

or

In [ ]:
!pip install seaborn
In [ ]:
import seaborn as sns

9. Plot pairwise dependencies of features Age, Fare, SibSp, Parch, Embarked и Survived.

In [ ]:
sns.pairplot(data[['Survived', 'Age', 'Fare',  'SibSp', 
                       'Parch', 'Embarked']]);

10. How does the ticket price (Fare) depend on the cabin class (Pclass)? Plot boxplot and catplot.

In [ ]:
sns.boxplot(x='Pclass', y='Fare', data=data)
Out[ ]:
<AxesSubplot:xlabel='Pclass', ylabel='Fare'>
In [ ]:
sns.catplot(x='Pclass', y='Fare', data=data)
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7f59baf70550>
In [ ]:
# s means marker size
sns.catplot(x='Pclass', y='Fare', data=data, kind='swarm', s=2)
/usr/lib/python3.9/site-packages/seaborn/categorical.py:1296: UserWarning: 25.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/usr/lib/python3.9/site-packages/seaborn/categorical.py:1296: UserWarning: 58.9% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7f59b95a1430>

10.1 How does the ticket price (Fare) depend on the number of siblings and spouses (SibSp)? Plot boxplot and catplot.

In [ ]:
# your code

10.2*. Plot the same graph as in task 10 but remove the outliers - the values of Fare that are too far away from the average.

In [ ]:
# the algorithm to remove outliers is the following:
# compute average value .mean()
# compute standard deviation: .std()
# filter data by the following criteria:
#    if the data is too far away from average (more than 2 standard deviations away), discard it
#    otherwise, keep it
# do this for all classes separately
# plot filtered data