File Input and Output

How are files stored?
Reading a text file, line-by-line
- File Object
- Lists from files
- Dictionaries from files
Exercise: Find the tallest tree
Writing a text file
Other Tools for Python File I/O

first, make sure you are in the right directory

you should see animals.txt and README.txt

pwd

u'/Users/snorlax13mba/UNIVERSE/2_Projects_Live/Software Carpentry Rockefeller Bootcamp June 2014/Loops_Functions_and_Reading_Files/B_File_Io'

ls

Animals.txt    File_IO.ipynb  README.txt     data/          images/

cat README.txt

Animals {MASS}	R Documentation
Brain and Body Weights for 28 Species

Description

Average brain and body weights for 28 species of land animals.

Animals: string of animal name

body: body weight in kg.

brain:  brain weight in g.

Source

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p. 57.

References

Venables, W. N. and Ripley, B. D. (1999) Modern Applied Statistics with S-PLUS. Third Edition. Springer.

cat animals.txt

Animal, body,brain
Mountain beaver,1.35,8.1
Cow,465,423
Grey wolf,36.33,119.5
Goat,27.66,115
Guinea pig,1.04,5.5
Dipliodocus,11700,50
Asian elephant,2547,4603
Donkey,187.1,419
Horse,521,655
Potar monkey,10,115
Cat,3.3,25.6
Giraffe,529,680
Gorilla,207,406
Human,62,1320
African elephant,6654,5712
Triceratops,9400,70
Rhesus monkey,6.8,179
Kangaroo,35,56
Golden hamster,0.12,1
Mouse,0.023,0.4
Rabbit,2.5,12.1
Sheep,55.5,175
Jaguar,100,157
Chimpanzee,52.16,440
Rat,0.28,1.9
Brachiosaurus,87000,154.5
Mole,0.122,3
Pig,192,180

1. How are files stored?

Binary Encoding of Text

A simple plain text file that contains the plain-text string "Rockefeller U." is stored in 15 bytes as:

RU

Some resources

Occasional Pain Point: `\n` versus `\r\n`

\n = New Line
- In most Unix systems including Mac OSX, \n is the standard line terminator for text files
- The non-printable value, 0101 1100, will occupy the byte that signifies the new line marker
\r = Carriage Return
- Windows text files typically terminate lines with the sequence \r\n
- Whle Python does its best to insulate you from having to keep track of \n versus \r\n, you may come across situations where the presence of Carriage Return characters is an issue.

Where does "Carriage Return" come from anyway?

CRLF

Common Plain-Text file types

Text (.txt)
Comma-Separated Values (.csv)
- Most spreadsheets allow for export of .csv files
- Common way of sharing "Flat" flies (one row = one observation)
- Many possible variations: headers, row names, separator characters, quotes, etc.
Space-padded, tab-delimited (), etc.
Source code, HTML, "Natural Language," XML, JSON, Yaml, etc. etc.

Files with Other Binary Encodings

Documents: databases (many open and proprietary formats), .pdf, .doc, .xls, etc.
Images: .jpg, .png, many other formats
Audio: mp3, ogg, etc.
Data from sensors, repeated measures, archaic formats, etc.

The Point:

Find out what format your input data is in
To the extent possible, look at your data
- Linux utilities: cat, less head, tail, wc, etc.
Test your expectations
Decide how to handle missing data and "bad" data
- Big subjects, not covered today
Allow plenty of time for dealing with file format ideosyncracies and conversions

2. Reading a text file, line-by-line

Look at the data
Read into a File Object
Create a List from the File
Create a Dictionary from File

Animals dataset

A list of animals, their body mass, and their brain mass

Question of interest: Which animal has the largest brain-to-body mass ratio? And the smallest?

Examine the data

Here we use the iPython Notebook "magics" to call the shell. You can also just do this from the command line

Look at the data
Read the docs

%%bash
ls

Animals.txt
File_IO.ipynb
README.txt
data
images

%%bash
wc Animals.txt
# what does the output mean?  call: man wc

      29      38     522 Animals.txt

%%bash
head Animals.txt

Animal, body,brain
Mountain beaver,1.35,8.1
Cow,465,423
Grey wolf,36.33,119.5
Goat,27.66,115
Guinea pig,1.04,5.5
Dipliodocus,11700,50
Asian elephant,2547,4603
Donkey,187.1,419
Horse,521,655

Read the data into a file object

two approaches

# For convenience, assign the path and file name to variables

read_this = 'Animals.txt'
print read_this

Animals.txt

# Approach 1: open, read, close
file_in = open(read_this, 'r')
all_lines = file_in.readlines()
file_in.close()   # important to close the file

print "Name of the file: ", file_in.name
print "Is File Closed? : ", file_in.closed
print "Opening mode : ", file_in.mode
print "Softspace flag : ", file_in.softspace

Name of the file:  Animals.txt
Is File Closed? :  True
Opening mode :  r
Softspace flag :  0

Another way to read a file: python `with()`

This formulation will automatically close the file after reading it

with open(read_this, 'r') as file_in:
    lines = file_in.readlines()
    # file_in.close()

type(lines)

list

lines[0:5]

['Animal, body,brain\n',
 'Mountain beaver,1.35,8.1\n',
 'Cow,465,423\n',
 'Grey wolf,36.33,119.5\n',
 'Goat,27.66,115\n']

# print each individual line -- iterate over the file
for waffles in lines:
    print waffles

Animal, body,brain

Mountain beaver,1.35,8.1

Cow,465,423

Grey wolf,36.33,119.5

Goat,27.66,115

Guinea pig,1.04,5.5

Dipliodocus,11700,50

Asian elephant,2547,4603

Donkey,187.1,419

Horse,521,655

Potar monkey,10,115

Cat,3.3,25.6

Giraffe,529,680

Gorilla,207,406

Human,62,1320

African elephant,6654,5712

Triceratops,9400,70

Rhesus monkey,6.8,179

Kangaroo,35,56

Golden hamster,0.12,1

Mouse,0.023,0.4

Rabbit,2.5,12.1

Sheep,55.5,175

Jaguar,100,157

Chimpanzee,52.16,440

Rat,0.28,1.9

Brachiosaurus,87000,154.5

Mole,0.122,3

Pig,192,180

# repr := representation of the line, revealing hidden characters
for line in lines:
    print repr(line)

'Animal, body,brain\n'
'Mountain beaver,1.35,8.1\n'
'Cow,465,423\n'
'Grey wolf,36.33,119.5\n'
'Goat,27.66,115\n'
'Guinea pig,1.04,5.5\n'
'Dipliodocus,11700,50\n'
'Asian elephant,2547,4603\n'
'Donkey,187.1,419\n'
'Horse,521,655\n'
'Potar monkey,10,115\n'
'Cat,3.3,25.6\n'
'Giraffe,529,680\n'
'Gorilla,207,406\n'
'Human,62,1320\n'
'African elephant,6654,5712\n'
'Triceratops,9400,70\n'
'Rhesus monkey,6.8,179\n'
'Kangaroo,35,56\n'
'Golden hamster,0.12,1\n'
'Mouse,0.023,0.4\n'
'Rabbit,2.5,12.1\n'
'Sheep,55.5,175\n'
'Jaguar,100,157\n'
'Chimpanzee,52.16,440\n'
'Rat,0.28,1.9\n'
'Brachiosaurus,87000,154.5\n'
'Mole,0.122,3\n'
'Pig,192,180\n'

Splitting Up Each Line: From a `String` to a `List`

some_line = lines[2]
print some_line

Cow,465,423

some_line[0]

'C'

type(some_line)

str

# What is the first element of some_line?
some_line[0]

# is that what you expected?

'C'

for _ in some_line:
    print _

C
o
w
,
4
6
5
,
4
2
3

print type(some_line)

<type 'str'>

line_split_at_commas = some_line.split(",")  # forms a python List object, splitting at the comma characters
print line_split_at_commas

['Cow', '465', '423\n']

print some_line.strip()        # gets rid of whitespace... 
# print type(some_line.strip())  # though result is still a string

Cow,465,423

my_new_line = some_line.strip()
print my_new_line[0]

C

line_list = some_line.strip().split(",")
print line_list

['Cow', '465', '423']

for l in line_list:
    print l

Cow
465
423

3 + 3

6

'3' + '3'

'33'

Multiple assignment trick: list unpacking

animal, body_mass, brain_mass = line_list

print "Animal = ", animal
print "Body Mass =", float(body_mass)
print "Brain Mass =", float(brain_mass)

Animal =  Cow
Body Mass = 465.0
Brain Mass = 423.0

3. Exercise: Which animal has biggest brain-to-body ratio? Smallest?

Read Animal.txt into a file object
Calculate brain / body ratio
- pay attention to integer arithmetic issues
Print ratio for each animal
Find min and max.
Do your results make sense to you?
Bonus 1:
- Create variables to keep track of min and max
- Just print the animal with the max and the animal with the min
Bonus 2:
- Find average ratio

pseudo-code

read the file into a LIST!!!!!!! here it happens to be called "lines"
for item in lines:
clean up the item. i,e., convert the item from a string into a NEW list
figure out which elements are the numbers you want
take the ratio
print the animal name and the ratio

# List-based solution
with open(read_this, 'r') as file_in:
    lines = file_in.readlines()
    
ratios = []
for line in lines:
    clean_line = line.strip().split(',')
    animal, body_mass, brain_mass = clean_line
    if animal == "Animal":
        pass     # skip the header line
    else:
        ratio = float(brain_mass) / (1000 * float(body_mass))
        print "Animal=",animal, "\tratio=", ratio
        ratios.append([animal, ratio])

Animal= Mountain beaver 	ratio= 0.006
Animal= Cow 	ratio= 0.000909677419355
Animal= Grey wolf 	ratio= 0.00328929259565
Animal= Goat 	ratio= 0.00415762834418
Animal= Guinea pig 	ratio= 0.00528846153846
Animal= Dipliodocus 	ratio= 4.2735042735e-06
Animal= Asian elephant 	ratio= 0.00180722418532
Animal= Donkey 	ratio= 0.00223944414751
Animal= Horse 	ratio= 0.00125719769674
Animal= Potar monkey 	ratio= 0.0115
Animal= Cat 	ratio= 0.00775757575758
Animal= Giraffe 	ratio= 0.0012854442344
Animal= Gorilla 	ratio= 0.001961352657
Animal= Human 	ratio= 0.0212903225806
Animal= African elephant 	ratio= 0.000858431018936
Animal= Triceratops 	ratio= 7.44680851064e-06
Animal= Rhesus monkey 	ratio= 0.0263235294118
Animal= Kangaroo 	ratio= 0.0016
Animal= Golden hamster 	ratio= 0.00833333333333
Animal= Mouse 	ratio= 0.0173913043478
Animal= Rabbit 	ratio= 0.00484
Animal= Sheep 	ratio= 0.00315315315315
Animal= Jaguar 	ratio= 0.00157
Animal= Chimpanzee 	ratio= 0.00843558282209
Animal= Rat 	ratio= 0.00678571428571
Animal= Brachiosaurus 	ratio= 1.77586206897e-06
Animal= Mole 	ratio= 0.0245901639344
Animal= Pig 	ratio= 0.0009375

# dictionary-based solution
with open(read_this, 'r') as file_in:
    lines = file_in.readlines()
    
ratios_dict = {} # empty dictionary
for line in lines:
    clean_line = line.strip().split(',')
    animal, body_mass, brain_mass = clean_line
    if animal == "Animal":
        pass     # skip the header line
    else:
        ratios_dict[animal] = float(brain_mass) / (1000 * (float(body_mass)))  # WATCH parentheses and order of operations

print ratios_dict

{'Sheep': 0.003153153153153153, 'Horse': 0.0012571976967370442, 'Potar monkey': 0.0115, 'Goat': 0.0041576283441793205, 'African elephant': 0.0008584310189359783, 'Asian elephant': 0.0018072241853160581, 'Mountain beaver': 0.006, 'Kangaroo': 0.0016, 'Chimpanzee': 0.00843558282208589, 'Giraffe': 0.001285444234404537, 'Rabbit': 0.00484, 'Grey wolf': 0.003289292595650977, 'Jaguar': 0.00157, 'Cow': 0.0009096774193548387, 'Rhesus monkey': 0.026323529411764707, 'Cat': 0.007757575757575758, 'Gorilla': 0.0019613526570048307, 'Brachiosaurus': 1.7758620689655172e-06, 'Donkey': 0.002239444147514698, 'Golden hamster': 0.008333333333333333, 'Guinea pig': 0.005288461538461539, 'Triceratops': 7.446808510638298e-06, 'Dipliodocus': 4.273504273504274e-06, 'Pig': 0.0009375, 'Rat': 0.0067857142857142855, 'Human': 0.02129032258064516, 'Mouse': 0.017391304347826087, 'Mole': 0.02459016393442623}

print ratios_dict.keys()

['Sheep', 'Horse', 'Potar monkey', 'Goat', 'African elephant', 'Asian elephant', 'Mountain beaver', 'Kangaroo', 'Chimpanzee', 'Giraffe', 'Rabbit', 'Grey wolf', 'Jaguar', 'Cow', 'Rhesus monkey', 'Cat', 'Gorilla', 'Brachiosaurus', 'Donkey', 'Golden hamster', 'Guinea pig', 'Triceratops', 'Dipliodocus', 'Pig', 'Rat', 'Human', 'Mouse', 'Mole']

print ratios_dict.values()

[0.003153153153153153, 0.0012571976967370442, 0.0115, 0.0041576283441793205, 0.0008584310189359783, 0.0018072241853160581, 0.006, 0.0016, 0.00843558282208589, 0.001285444234404537, 0.00484, 0.003289292595650977, 0.00157, 0.0009096774193548387, 0.026323529411764707, 0.007757575757575758, 0.0019613526570048307, 1.7758620689655172e-06, 0.002239444147514698, 0.008333333333333333, 0.005288461538461539, 7.446808510638298e-06, 4.273504273504274e-06, 0.0009375, 0.0067857142857142855, 0.02129032258064516, 0.017391304347826087, 0.02459016393442623]

print (ratios_dict['Cat'])

0.00775757575758

# using the dictionary like a database
my_study_subjects = ['Cat', 'Rabbit', 'Goat']
for subj in my_study_subjects:
    print subj, (ratios_dict[subj])

Cat 0.00775757575758
Rabbit 0.00484
Goat 0.00415762834418

4. Writing a text file

Suppose we want to save our calculated brain-to-body ratios

file_out_name = 'Animal_brain_to_body_ratio.txt'
file_out = open(file_out_name, 'w')
for r in ratios:
    line_out = r[0] + ", " + str(r[1]) + "\n"
    file_out.write(line_out)
file_out.close()

5. Other Tools for Reading and Writing Files

`csv` module

somewhat more "helpful" way of reading and writing flat files
official documentation

pandas

Beyond file i/o, a powerful set of full-fledged "Data Wrangling" tools for python
cross-sectional, time-series
Intro video at http://vimeo.com/59324550
Book info: http://shop.oreilly.com/product/0636920023784.do

Serialization

Text files such as .txt and .csv are useful for recording "flat" files
However sometimes we want to save more complex objects, like dictionaries
This is referred to as serialization
Many serialization tools in python, including "pickle"

File Input and Output

Other Tools for Python File I/O

first, make sure you are in the right directory

1. How are files stored?

Binary Encoding of Text

Some resources

Occasional Pain Point: \n versus \r\n

Where does "Carriage Return" come from anyway?

Common Plain-Text file types

Files with Other Binary Encodings

The Point:

2. Reading a text file, line-by-line

Animals dataset

Examine the data

Read the data into a file object

Another way to read a file: python with()

Splitting Up Each Line: From a String to a List

Multiple assignment trick: list unpacking

3. Exercise: Which animal has biggest brain-to-body ratio? Smallest?

pseudo-code

4. Writing a text file

5. Other Tools for Reading and Writing Files

csv module

pandas

Serialization

Occasional Pain Point: `\n` versus `\r\n`

Another way to read a file: python `with()`

Splitting Up Each Line: From a `String` to a `List`

`csv` module