Workshop 20

File system. Practice

Python provides functions for working with the file system in the os module. These functions allow you to perform many tasks that would be routine otherwise.

In [ ]:
import os

Iterate through files in the specified directory: os.scandir(), os.listdir()

https://docs.python.org/3/library/os.html#os.scandir

Objects inside the loop are special os.DirEntry objects

https://docs.python.org/3/library/os.html#os.DirEntry

They have fields .name, .path and methods .is_dir(), .is_file() to make working with them easier. They can also be directly used in open(file, mode).

In [ ]:
for file in os.scandir():
    print(file)
    
for file in os.listdir():
    print(file)
    
for file in os.listdir('data_folder'):
    print(file)

Check whether a path points to a file or a directory: os.path.isfile(path) and os.path.isdir(path).

These functions work with both strings and os.DirEntry objects.

In [ ]:
for file in os.scandir():
    if os.path.isfile(file):
        print(f'{file.name} is a file')
        
    if os.path.isdir(file):
        print(f'{file.name} is a directory')

Get file size and file parameters

os.path.getsize(path) returns the size of the file in bytes. Is may not be the same as the number of characters (some characters take more than 1 byte to write).

In [ ]:
path = './text.txt'
print(f'size of {path} is {os.path.getsize(path)} bytes')
print()
In [ ]:
for file in os.scandir():
    if os.path.isfile(file):
        print(f'size of {file.name} is {os.path.getsize(file)} bytes')
        with open(file, 'r') as f:
            try:
                contents = f.read()
                print(f'length of string of all contents of the file is {len(contents)} characters')
            except UnicodeDecodeError as e:
                print("Can't decode Unicode in the file")
        print()

Task 20.1

Print names of all files in the current folder with size larger than 100 bytes.

Task 20.2

Print names of all files in the current folder which contain a word "print".

Additional part: print a line that contains the word "print" after the name of the file. If several lines contain this word, print any one of them.

Task 20.3

Print a list of all files in the current folder and its subfolders.
Lists below are formatted for illustration, you can just print a list of names.

You can get the name of the current folder with os.getcwd(). Use os.scandir(subfolder) or os.listdir(subfolder) to get lists of files in folders other than the current one.

You can use the line print(os.path.split(os.getcwd())[-1] + '/') to print the current directory without the full path.

In the examples below, this is the actual structure of files:

folder/
|-data.txt
|-Workshop 20. File system. Practice.ipynb
|-my_file2.txt
|-datafolder/
  |-data.txt
  |-moredata.txt
  |-another_folder/
    |-data.csv
    |-folder/
      |-hidden_data.json
|-more_folders/
  |-Workshop 21.ipynb

Version 1: 2 levels. Only the folder and the folders one level down. No formatting. Order doesn't matter. Add / to names of folders.

folder/
data.txt
Lecture 8. Files, Exceptions.ipynb
my_file2.txt
datafolder/
data.txt
moredata.txt
another_folder/
more_folders/
Workshop 21.ipynb

Version 2: 2 levels. Formatting as above. Order of files in a folder doesn't matter.

folder/
|-data.txt
|-Lecture 8. Files, Exceptions.ipynb
|-my_file2.txt
|-datafolder/
  |-data.txt
  |-moredata.txt
  |-another_folder/
|-more_folders/
  |-Workshop 21.ipynb

Version 3: all levels. Formatted as the example below. Order of files in a folder doesn't matter.

folder/
|-data.txt
|-Workshop 20. File system. Practice.ipynb
|-my_file2.txt
|-datafolder/
  |-data.txt
  |-moredata.txt
  |-another_folder/
    |-data.csv
    |-folder/
      |-hidden_data.json
|-more_folders/
  |-Workshop 21.ipynb

Task 19.2 Coin flips

Write a program that simulates coin flips by generating random numbers. Generate a sequence of coin flips until you get either heads or tails 3 times in a row.

Run the simulation 15 times. Compute the average number of flips it takes to get the same result 3 times in a row.

Example output
H T H T T T (6 flips)
H H T H H T H H H (9 flips)
T T T (3 flips)
...
T H H H (4 flips)
Average: 8.1 flips

Task 20.4. Data file

In the file sales.csv (link) there is a fictional set of sales data for employees of some company.

The structure of the file is the following:

division,level of education,training level,work experience,salary,sales
computer hardware,some college,1,5,81769,302611
peripherals,bachelor's degree,1,4,89792,274336
peripherals,high school,0,5,70797,256854

The first row is a header, then all rows contain data.

You task is the following:

  1. Read the file and save all data into a matrix (list of lists).
    • Read a line, split with .split(','), save the result into a list.
    • When saving, convert salary and sales data to integers.
  2. Output the data row for the employee with the highest sales number.
  3. Compute and print the average salary of the employees.
  4. Print 5 data lines corresponding to employees with the highest sales/salary ratio.

Task 20.5. Frequency analysis (with punctuation)

Write a function word_count() that takes a string as an input and outputs a dictionary where words are keys and the number of their appearances are values.

In this task, a word is always one of

  1. A sequence of ASCII letters in a row ("the", "I") - a simple word
  2. Two simple words joined by a single apostrophe character ("don't", "I'm")

Additional rules and assumptions:

  1. Words in the dictionary should be lowercase. "The" and "the" should result in a dictionary {'the': 2}.
  2. All punctuation is ignored except the apostrophe in a contraction.
  3. Words don't have to be separated by whitespace ("The_spacebar_is_broken" should result in a dictionary {'the': 1, 'spacebar': 1, 'is':1, 'broken':1}.

You can use module string to get all punctuation characters.

In [ ]:
import string
print(string.punctuation)
In [ ]:
import string

def word_count(sentence):
    word_dict = {}
    return word_dict
In [ ]:
# Tests

def check(output, answer):
    if output == answer:
        print("OK")
        print(output)
    else:
        print("WA")
        print(f"Expected {answer}, got {output}")

        
tests = [('word',{'word': 1}),
        ('a word and another word',{'a': 1, 'word': 2, 'and': 1, 'another': 1}),
        ('punctuation? You! Ignore it: all^ of*((%@)) %#@%:it',
         {'punctuation': 1, 'you': 1, 'ignore': 1, 'it': 2, 'all': 1, 'of': 1}),
        ("Don't forget about apostrophes. They're tricky",
        {"don't": 1, 'forget': 1, 'about': 1, 'apostrophes': 1, "they're": 1, 'tricky': 1}),
        ("'apostrophes' are just regular words, like apostrophes.",
         {'apostrophes': 2, 'are': 1, 'just': 1, 'regular': 1, 'words': 1, 'like': 1}),
        ('be   extra !!! careful *** not to make words out of nothing',
         {'be': 1, 'extra': 1, 'careful': 1, 'not': 1, 'to': 1, 'make': 1, 'words': 1, 'out': 1, 'of': 1, 'nothing': 1}),
         ("''double'' quotation marks", {'double': 1, 'quotation': 1, 'marks': 1}),
         ("The_spacebar_is_broken,so.i'm_writing.Like.This",
          {'the': 1, 'spacebar': 1, 'is': 1, 'broken': 1, 'so': 1, "i'm": 1, 'writing': 1, 'like': 1, 'this': 1})
       ]

for test in tests:
    check(word_count(test[0]), test[1])
In [ ]: