File Input and Output

  1. How are files stored?
  2. Reading a text file, line-by-line
    • File Object
    • Lists from files
    • Dictionaries from files
  3. Exercise: Find the tallest tree
  4. Writing a text file
  5. Other Tools for Python File I/O

first, make sure you are in the right directory

you should see animals.txt and README.txt

In [1]:
pwd
Out[1]:
u'/Users/snorlax13mba/UNIVERSE/2_Projects_Live/Software Carpentry Rockefeller Bootcamp June 2014/Loops_Functions_and_Reading_Files/B_File_Io'
In [2]:
ls
Animals.txt    File_IO.ipynb  README.txt     data/          images/

In [3]:
cat README.txt
Animals {MASS}	R Documentation
Brain and Body Weights for 28 Species

Description

Average brain and body weights for 28 species of land animals.

Animals: string of animal name

body: body weight in kg.

brain:  brain weight in g.

Source

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p. 57.

References

Venables, W. N. and Ripley, B. D. (1999) Modern Applied Statistics with S-PLUS. Third Edition. Springer.


In [4]:
cat animals.txt
Animal, body,brain
Mountain beaver,1.35,8.1
Cow,465,423
Grey wolf,36.33,119.5
Goat,27.66,115
Guinea pig,1.04,5.5
Dipliodocus,11700,50
Asian elephant,2547,4603
Donkey,187.1,419
Horse,521,655
Potar monkey,10,115
Cat,3.3,25.6
Giraffe,529,680
Gorilla,207,406
Human,62,1320
African elephant,6654,5712
Triceratops,9400,70
Rhesus monkey,6.8,179
Kangaroo,35,56
Golden hamster,0.12,1
Mouse,0.023,0.4
Rabbit,2.5,12.1
Sheep,55.5,175
Jaguar,100,157
Chimpanzee,52.16,440
Rat,0.28,1.9
Brachiosaurus,87000,154.5
Mole,0.122,3
Pig,192,180

1. How are files stored?

Binary Encoding of Text

A simple plain text file that contains the plain-text string "Rockefeller U." is stored in 15 bytes as:

RU

RU

Occasional Pain Point: \n versus \r\n

  • \n = New Line
    • In most Unix systems including Mac OSX, \n is the standard line terminator for text files
    • The non-printable value, 0101 1100, will occupy the byte that signifies the new line marker
  • \r = Carriage Return
    • Windows text files typically terminate lines with the sequence \r\n
    • Whle Python does its best to insulate you from having to keep track of \n versus \r\n, you may come across situations where the presence of Carriage Return characters is an issue.

Where does "Carriage Return" come from anyway?

CRLF

CRLF

Common Plain-Text file types

  • Text (.txt)
  • Comma-Separated Values (.csv)
    • Most spreadsheets allow for export of .csv files
    • Common way of sharing "Flat" flies (one row = one observation)
    • Many possible variations: headers, row names, separator characters, quotes, etc.
  • Space-padded, tab-delimited (), etc.
  • Source code, HTML, "Natural Language," XML, JSON, Yaml, etc. etc.

Files with Other Binary Encodings

  • Documents: databases (many open and proprietary formats), .pdf, .doc, .xls, etc.
  • Images: .jpg, .png, many other formats
  • Audio: mp3, ogg, etc.
  • Data from sensors, repeated measures, archaic formats, etc.

The Point:

  • Find out what format your input data is in
  • To the extent possible, look at your data
    • Linux utilities: cat, less head, tail, wc, etc.
  • Test your expectations
  • Decide how to handle missing data and "bad" data
    • Big subjects, not covered today
  • Allow plenty of time for dealing with file format ideosyncracies and conversions

2. Reading a text file, line-by-line

  • Look at the data
  • Read into a File Object
  • Create a List from the File
  • Create a Dictionary from File

Animals dataset

A list of animals, their body mass, and their brain mass

Question of interest: Which animal has the largest brain-to-body mass ratio? And the smallest?

Examine the data

Here we use the iPython Notebook "magics" to call the shell. You can also just do this from the command line

  • Look at the data
  • Read the docs
In [5]:
%%bash
ls
Animals.txt
File_IO.ipynb
README.txt
data
images

In [6]:
%%bash
wc Animals.txt
# what does the output mean?  call: man wc
      29      38     522 Animals.txt

In [7]:
%%bash
head Animals.txt
Animal, body,brain
Mountain beaver,1.35,8.1
Cow,465,423
Grey wolf,36.33,119.5
Goat,27.66,115
Guinea pig,1.04,5.5
Dipliodocus,11700,50
Asian elephant,2547,4603
Donkey,187.1,419
Horse,521,655

Read the data into a file object

  • two approaches
In [8]:
# For convenience, assign the path and file name to variables

read_this = 'Animals.txt'
print read_this
Animals.txt

In [9]:
# Approach 1: open, read, close
file_in = open(read_this, 'r')
all_lines = file_in.readlines()
file_in.close()   # important to close the file
In [9]:
 
In [10]:
print "Name of the file: ", file_in.name
print "Is File Closed? : ", file_in.closed
print "Opening mode : ", file_in.mode
print "Softspace flag : ", file_in.softspace
Name of the file:  Animals.txt
Is File Closed? :  True
Opening mode :  r
Softspace flag :  0

Another way to read a file: python with()

  • This formulation will automatically close the file after reading it
In [11]:
with open(read_this, 'r') as file_in:
    lines = file_in.readlines()
    # file_in.close()
In [12]:
type(lines)
Out[12]:
list
In [13]:
lines[0:5]
Out[13]:
['Animal, body,brain\n',
 'Mountain beaver,1.35,8.1\n',
 'Cow,465,423\n',
 'Grey wolf,36.33,119.5\n',
 'Goat,27.66,115\n']
In [14]:
# print each individual line -- iterate over the file
for waffles in lines:
    print waffles
Animal, body,brain

Mountain beaver,1.35,8.1

Cow,465,423

Grey wolf,36.33,119.5

Goat,27.66,115

Guinea pig,1.04,5.5

Dipliodocus,11700,50

Asian elephant,2547,4603

Donkey,187.1,419

Horse,521,655

Potar monkey,10,115

Cat,3.3,25.6

Giraffe,529,680

Gorilla,207,406

Human,62,1320

African elephant,6654,5712

Triceratops,9400,70

Rhesus monkey,6.8,179

Kangaroo,35,56

Golden hamster,0.12,1

Mouse,0.023,0.4

Rabbit,2.5,12.1

Sheep,55.5,175

Jaguar,100,157

Chimpanzee,52.16,440

Rat,0.28,1.9

Brachiosaurus,87000,154.5

Mole,0.122,3

Pig,192,180


In [15]:
# repr := representation of the line, revealing hidden characters
for line in lines:
    print repr(line)
'Animal, body,brain\n'
'Mountain beaver,1.35,8.1\n'
'Cow,465,423\n'
'Grey wolf,36.33,119.5\n'
'Goat,27.66,115\n'
'Guinea pig,1.04,5.5\n'
'Dipliodocus,11700,50\n'
'Asian elephant,2547,4603\n'
'Donkey,187.1,419\n'
'Horse,521,655\n'
'Potar monkey,10,115\n'
'Cat,3.3,25.6\n'
'Giraffe,529,680\n'
'Gorilla,207,406\n'
'Human,62,1320\n'
'African elephant,6654,5712\n'
'Triceratops,9400,70\n'
'Rhesus monkey,6.8,179\n'
'Kangaroo,35,56\n'
'Golden hamster,0.12,1\n'
'Mouse,0.023,0.4\n'
'Rabbit,2.5,12.1\n'
'Sheep,55.5,175\n'
'Jaguar,100,157\n'
'Chimpanzee,52.16,440\n'
'Rat,0.28,1.9\n'
'Brachiosaurus,87000,154.5\n'
'Mole,0.122,3\n'
'Pig,192,180\n'

Splitting Up Each Line: From a String to a List

In [16]:
some_line = lines[2]
print some_line
Cow,465,423


In [17]:
some_line[0]
Out[17]:
'C'
In [18]:
type(some_line)
Out[18]:
str
In [18]:
 
In [19]:
# What is the first element of some_line?
some_line[0]

# is that what you expected?
Out[19]:
'C'
In [20]:
for _ in some_line:
    print _
C
o
w
,
4
6
5
,
4
2
3



In [21]:
print type(some_line)
<type 'str'>

In [22]:
line_split_at_commas = some_line.split(",")  # forms a python List object, splitting at the comma characters
print line_split_at_commas
['Cow', '465', '423\n']

In [23]:
print some_line.strip()        # gets rid of whitespace... 
# print type(some_line.strip())  # though result is still a string
Cow,465,423

In [24]:
my_new_line = some_line.strip()
print my_new_line[0]
C

In [24]:
 
In [24]:
 
In [25]:
line_list = some_line.strip().split(",")
print line_list
['Cow', '465', '423']

In [26]:
for l in line_list:
    print l
Cow
465
423

In [27]:
3 + 3
Out[27]:
6
In [30]:
'3' + '3'
Out[30]:
'33'

Multiple assignment trick: list unpacking

In [31]:
animal, body_mass, brain_mass = line_list

print "Animal = ", animal
print "Body Mass =", float(body_mass)
print "Brain Mass =", float(brain_mass)
Animal =  Cow
Body Mass = 465.0
Brain Mass = 423.0

3. Exercise: Which animal has biggest brain-to-body ratio? Smallest?

  • Read Animal.txt into a file object
  • Calculate brain / body ratio
    • pay attention to integer arithmetic issues
  • Print ratio for each animal
  • Find min and max.
  • Do your results make sense to you?

  • Bonus 1:
    • Create variables to keep track of min and max
    • Just print the animal with the max and the animal with the min
  • Bonus 2:
    • Find average ratio
In [31]:
 

pseudo-code

  1. read the file into a LIST!!!!!!! here it happens to be called "lines"
  2. for item in lines:
  3. clean up the item. i,e., convert the item from a string into a NEW list
  4. figure out which elements are the numbers you want
  5. take the ratio
  6. print the animal name and the ratio
In [31]:
 
In [32]:
# List-based solution
with open(read_this, 'r') as file_in:
    lines = file_in.readlines()
    
ratios = []
for line in lines:
    clean_line = line.strip().split(',')
    animal, body_mass, brain_mass = clean_line
    if animal == "Animal":
        pass     # skip the header line
    else:
        ratio = float(brain_mass) / (1000 * float(body_mass))
        print "Animal=",animal, "\tratio=", ratio
        ratios.append([animal, ratio])
Animal= Mountain beaver 	ratio= 0.006
Animal= Cow 	ratio= 0.000909677419355
Animal= Grey wolf 	ratio= 0.00328929259565
Animal= Goat 	ratio= 0.00415762834418
Animal= Guinea pig 	ratio= 0.00528846153846
Animal= Dipliodocus 	ratio= 4.2735042735e-06
Animal= Asian elephant 	ratio= 0.00180722418532
Animal= Donkey 	ratio= 0.00223944414751
Animal= Horse 	ratio= 0.00125719769674
Animal= Potar monkey 	ratio= 0.0115
Animal= Cat 	ratio= 0.00775757575758
Animal= Giraffe 	ratio= 0.0012854442344
Animal= Gorilla 	ratio= 0.001961352657
Animal= Human 	ratio= 0.0212903225806
Animal= African elephant 	ratio= 0.000858431018936
Animal= Triceratops 	ratio= 7.44680851064e-06
Animal= Rhesus monkey 	ratio= 0.0263235294118
Animal= Kangaroo 	ratio= 0.0016
Animal= Golden hamster 	ratio= 0.00833333333333
Animal= Mouse 	ratio= 0.0173913043478
Animal= Rabbit 	ratio= 0.00484
Animal= Sheep 	ratio= 0.00315315315315
Animal= Jaguar 	ratio= 0.00157
Animal= Chimpanzee 	ratio= 0.00843558282209
Animal= Rat 	ratio= 0.00678571428571
Animal= Brachiosaurus 	ratio= 1.77586206897e-06
Animal= Mole 	ratio= 0.0245901639344
Animal= Pig 	ratio= 0.0009375

In [33]:
# dictionary-based solution
with open(read_this, 'r') as file_in:
    lines = file_in.readlines()
    
ratios_dict = {} # empty dictionary
for line in lines:
    clean_line = line.strip().split(',')
    animal, body_mass, brain_mass = clean_line
    if animal == "Animal":
        pass     # skip the header line
    else:
        ratios_dict[animal] = float(brain_mass) / (1000 * (float(body_mass)))  # WATCH parentheses and order of operations
In [34]:
print ratios_dict
{'Sheep': 0.003153153153153153, 'Horse': 0.0012571976967370442, 'Potar monkey': 0.0115, 'Goat': 0.0041576283441793205, 'African elephant': 0.0008584310189359783, 'Asian elephant': 0.0018072241853160581, 'Mountain beaver': 0.006, 'Kangaroo': 0.0016, 'Chimpanzee': 0.00843558282208589, 'Giraffe': 0.001285444234404537, 'Rabbit': 0.00484, 'Grey wolf': 0.003289292595650977, 'Jaguar': 0.00157, 'Cow': 0.0009096774193548387, 'Rhesus monkey': 0.026323529411764707, 'Cat': 0.007757575757575758, 'Gorilla': 0.0019613526570048307, 'Brachiosaurus': 1.7758620689655172e-06, 'Donkey': 0.002239444147514698, 'Golden hamster': 0.008333333333333333, 'Guinea pig': 0.005288461538461539, 'Triceratops': 7.446808510638298e-06, 'Dipliodocus': 4.273504273504274e-06, 'Pig': 0.0009375, 'Rat': 0.0067857142857142855, 'Human': 0.02129032258064516, 'Mouse': 0.017391304347826087, 'Mole': 0.02459016393442623}

In [35]:
print ratios_dict.keys()
['Sheep', 'Horse', 'Potar monkey', 'Goat', 'African elephant', 'Asian elephant', 'Mountain beaver', 'Kangaroo', 'Chimpanzee', 'Giraffe', 'Rabbit', 'Grey wolf', 'Jaguar', 'Cow', 'Rhesus monkey', 'Cat', 'Gorilla', 'Brachiosaurus', 'Donkey', 'Golden hamster', 'Guinea pig', 'Triceratops', 'Dipliodocus', 'Pig', 'Rat', 'Human', 'Mouse', 'Mole']

In [36]:
print ratios_dict.values()
[0.003153153153153153, 0.0012571976967370442, 0.0115, 0.0041576283441793205, 0.0008584310189359783, 0.0018072241853160581, 0.006, 0.0016, 0.00843558282208589, 0.001285444234404537, 0.00484, 0.003289292595650977, 0.00157, 0.0009096774193548387, 0.026323529411764707, 0.007757575757575758, 0.0019613526570048307, 1.7758620689655172e-06, 0.002239444147514698, 0.008333333333333333, 0.005288461538461539, 7.446808510638298e-06, 4.273504273504274e-06, 0.0009375, 0.0067857142857142855, 0.02129032258064516, 0.017391304347826087, 0.02459016393442623]

In [37]:
print (ratios_dict['Cat'])
0.00775757575758

In [38]:
# using the dictionary like a database
my_study_subjects = ['Cat', 'Rabbit', 'Goat']
for subj in my_study_subjects:
    print subj, (ratios_dict[subj])
Cat 0.00775757575758
Rabbit 0.00484
Goat 0.00415762834418

In [38]:
 

4. Writing a text file

  • Suppose we want to save our calculated brain-to-body ratios
In [172]:
file_out_name = 'Animal_brain_to_body_ratio.txt'
file_out = open(file_out_name, 'w')
for r in ratios:
    line_out = r[0] + ", " + str(r[1]) + "\n"
    file_out.write(line_out)
file_out.close()
In [172]:
 

5. Other Tools for Reading and Writing Files

csv module

pandas

Serialization

  • Text files such as .txt and .csv are useful for recording "flat" files
  • However sometimes we want to save more complex objects, like dictionaries
  • This is referred to as serialization
  • Many serialization tools in python, including "pickle"
In []:
 
In []: