Workshop 23. Data analysis examples. Visualization.¶

import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/l8doku/f291d09e88c866d3a044212b45cb5e23/raw/5d7e7aa3ca4fd03ab3c051b4adc304c4ceb01e81/titanic_train.csv')
# from titanic

data.head(10)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

data.describe()

Tasks and examples¶

Change the code in the corresponding cells or write your own

1. How many men / women were on board?¶

# change this code
data['Age'].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtype: int64

2. Print the distribution of Pclass (socioeconomic status). Additionally - the same distribution is only for men / women separately.¶

# change this code
# see what happens in margins=True
pd.crosstab(data['SibSp'], data['Survived'], margins=False)

3. How much did one passenger pay on average? Find the standard deviation of this value. Better to round to 2 decimal places.¶

mean_fare = # your code, method .mean()
# std_fare = # same type of method
# median_fare = # same type of method
print(f'{mean_fare:.3}')
# other prints

1.11

4. Is it true that young people were more likely to survive than old people? Let the "young" be those who are under 30, and "old" - those who are over 60.¶

# use the same logical statement indexing as on previous workshops to filter old and young
young = data[data['Age'] < 30]
# then use .mean() over the "Survived" column to get the average chance of survival.
# use print(f'{variable:.3}') to only output up to two decimal places

5. Is it true that women were more likely to survive than men?¶

# use the same logic as in task 4

6. What's the most popular name among the male passengers on the Titanic?¶

# name has structure
# last_name, Mr. first_name (middle_name)
# three parts separated by spaces
def get_first_name(full_name):
    first_name = full_name.split()[2]
    return first_name
    
# use this column of data to solve the task
data['Name'].apply(get_first_name)

0           Owen
1           John
2          Laina
3        Jacques
4        William
         ...    
886       Juozas
887     Margaret
888    Catherine
889         Karl
890      Patrick
Name: Name, Length: 891, dtype: object

6.1. Is it true that people from third class (Pclass=3) have shorter surnames than people from other classes?¶

# use logic similar to task 6
# create a new column with surname length.
# Then filter out passengers by class and find the average name length of each class
# you may use a lambda function instead of a regular one

7. Compare the distribution of ticket prices for the rescued and the victims.¶

# your code to obtain a column of rescued passengers
# your code to obtain a column of victims
rescued.hist(color="green", label='Survived')
# histogram for victims
import matplotlib.pyplot as plt
plt.title('Ticket fare for survived passengers and victims')
plt.xlabel('Pounds')
plt.ylabel('Frequency')
plt.legend();

8*. How does the average age of men / women differ by class of service?¶

# one way to solve:
# get different values for sex/class with data['Pclass'].unique()
# iterate over various values of unique sets
# filter the data by current iterations
# compute average age

# another way to solve: groupby
# for (pclass, sex), filtered_dt in data.groupby(['Pclass', 'Sex']):

Visualization. Seaborn¶

File - Settings (Preferences) - Project: [Project-name] - Python Interpreter - + - enter "seaborn" - Install Package

or

!pip install seaborn

import seaborn as sns

9. Plot pairwise dependencies of features `Age`, `Fare`, `SibSp`, `Parch`, `Embarked` и `Survived`.¶

sns.pairplot(data[['Survived', 'Age', 'Fare',  'SibSp', 
                       'Parch', 'Embarked']]);

10. How does the ticket price (`Fare`) depend on the cabin class (`Pclass`)? Plot boxplot and catplot.¶

sns.boxplot(x='Pclass', y='Fare', data=data)

<AxesSubplot:xlabel='Pclass', ylabel='Fare'>

sns.catplot(x='Pclass', y='Fare', data=data)

<seaborn.axisgrid.FacetGrid at 0x7f59baf70550>

# s means marker size
sns.catplot(x='Pclass', y='Fare', data=data, kind='swarm', s=2)

/usr/lib/python3.9/site-packages/seaborn/categorical.py:1296: UserWarning: 25.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/usr/lib/python3.9/site-packages/seaborn/categorical.py:1296: UserWarning: 58.9% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

<seaborn.axisgrid.FacetGrid at 0x7f59b95a1430>

10.1 How does the ticket price (`Fare`) depend on the number of siblings and spouses (`SibSp`)? Plot boxplot and catplot.¶

# your code

10.2*. Plot the same graph as in task 10 but remove the outliers - the values of `Fare` that are too far away from the average.¶

# the algorithm to remove outliers is the following:
# compute average value .mean()
# compute standard deviation: .std()
# filter data by the following criteria:
#    if the data is too far away from average (more than 2 standard deviations away), discard it
#    otherwise, keep it
# do this for all classes separately
# plot filtered data

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Survived	0	1
SibSp
0	398	210
1	97	112
2	15	13
3	12	4
4	15	3
5	5	0
8	7	0

Workshop 23. Data analysis examples. Visualization.¶

Tasks and examples¶

1. How many men / women were on board?¶

2. Print the distribution of Pclass (socioeconomic status). Additionally - the same distribution is only for men / women separately.¶

3. How much did one passenger pay on average? Find the standard deviation of this value. Better to round to 2 decimal places.¶

4. Is it true that young people were more likely to survive than old people? Let the "young" be those who are under 30, and "old" - those who are over 60.¶

5. Is it true that women were more likely to survive than men?¶

6. What's the most popular name among the male passengers on the Titanic?¶

6.1. Is it true that people from third class (Pclass=3) have shorter surnames than people from other classes?¶

7. Compare the distribution of ticket prices for the rescued and the victims.¶

8*. How does the average age of men / women differ by class of service?¶

Visualization. Seaborn¶

9. Plot pairwise dependencies of features Age, Fare, SibSp, Parch, Embarked и Survived.¶

10. How does the ticket price (Fare) depend on the cabin class (Pclass)? Plot boxplot and catplot.¶

10.1 How does the ticket price (Fare) depend on the number of siblings and spouses (SibSp)? Plot boxplot and catplot.¶

10.2*. Plot the same graph as in task 10 but remove the outliers - the values of Fare that are too far away from the average.¶

9. Plot pairwise dependencies of features `Age`, `Fare`, `SibSp`, `Parch`, `Embarked` и `Survived`.¶

10. How does the ticket price (`Fare`) depend on the cabin class (`Pclass`)? Plot boxplot and catplot.¶

10.1 How does the ticket price (`Fare`) depend on the number of siblings and spouses (`SibSp`)? Plot boxplot and catplot.¶

10.2*. Plot the same graph as in task 10 but remove the outliers - the values of `Fare` that are too far away from the average.¶