{
"metadata": {
"name": "",
"signature": "sha256:c40e50655840ba4cddbd867586d8ff72c0d9f62f75968a117063210b2769b62a"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# File Input and Output \n",
"\n",
"1. How are files stored? \n",
"2. Reading a text file, line-by-line \n",
" * File Object\n",
" * Lists from files\n",
" * Dictionaries from files\n",
"3. Exercise: Find the tallest tree\n",
"4. Writing a text file\n",
"5. Other Tools for Python File I/O \n",
"----------\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## first, make sure you are in the right directory\n",
"\n",
"you should see animals.txt and README.txt"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"pwd"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"ls"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"cat README.txt"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"cat animals.txt"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. How are files stored? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Binary Encoding of Text\n",
"\n",
"A simple plain text file that contains the plain-text string \"Rockefeller U.\" is stored in 15 bytes as: \n",
"\n",
"\n",
"![RU](images/01_Rockefeller.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some resources \n",
"\n",
"* Wikipedia: [ASCII](https://en.wikipedia.org/wiki/Ascii)\n",
"* [ The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Occasional Pain Point: `\\n` versus `\\r\\n` \n",
"\n",
"* `\\n` = New Line\n",
" * In most Unix systems including Mac OSX, `\\n` is the standard line terminator for text files \n",
" * The non-printable value, 0101 1100, will occupy the byte that signifies the new line marker\n",
"* `\\r` = Carriage Return\n",
" * Windows text files typically terminate lines with the sequence `\\r\\n`\n",
" * Whle Python does its best to insulate you from having to keep track of `\\n` versus `\\r\\n`, you may come across situations where the presence of Carriage Return characters is an issue.\n",
" \n",
" \n",
"### Where does \"Carriage Return\" come from anyway? \n",
"\n",
"![CRLF](images/02_CRLF.png)\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Common Plain-Text file types\n",
"\n",
"* Text (.txt)\n",
"* Comma-Separated Values (.csv)\n",
" * Most spreadsheets allow for export of .csv files\n",
" * Common way of sharing \"Flat\" flies (one row = one observation)\n",
" * Many possible variations: headers, row names, separator characters, quotes, etc.\n",
"* Space-padded, tab-delimited (\\t), etc.\n",
"* Source code, HTML, \"Natural Language,\" XML, JSON, Yaml, etc. etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Files with Other Binary Encodings \n",
"\n",
"* Documents: databases (many open and proprietary formats), .pdf, .doc, .xls, etc.\n",
"* Images: .jpg, .png, many other formats\n",
"* Audio: mp3, ogg, etc.\n",
"* Data from sensors, repeated measures, archaic formats, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Point: \n",
"\n",
"* Find out what format your input data is in\n",
"* To the extent possible, look at your data\n",
" * Linux utilities: `cat`, `less` `head`, `tail`, `wc`, etc.\n",
"* Test your expectations\n",
"* Decide how to handle missing data and \"bad\" data\n",
" * Big subjects, not covered today\n",
"* Allow plenty of time for dealing with file format ideosyncracies and conversions\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Reading a text file, line-by-line \n",
"\n",
"* Look at the data\n",
"* Read into a File Object\n",
"* Create a List from the File\n",
"* Create a Dictionary from File\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Animals dataset\n",
"\n",
"> A list of animals, their body mass, and their brain mass\n",
"\n",
"> Question of interest: Which animal has the largest brain-to-body mass ratio? And the smallest?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Examine the data\n",
"\n",
"> Here we use the iPython Notebook \"magics\" to call the shell. You can also just do this from the command line\n",
"\n",
"* Look at the data\n",
"* Read the docs"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"%%bash\n",
"ls"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"%%bash\n",
"wc Animals.txt\n",
"# what does the output mean? call: man wc"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"%%bash\n",
"head Animals.txt"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read the data into a file object\n",
"\n",
"* two approaches"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# For convenience, assign the path and file name to variables\n",
"\n",
"read_this = 'Animals.txt'\n",
"print read_this"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# Approach 1: open, read, close\n",
"file_in = open(read_this, 'r')\n",
"all_lines = file_in.readlines()\n",
"file_in.close() # important to close the file"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print \"Name of the file: \", file_in.name\n",
"print \"Is File Closed? : \", file_in.closed\n",
"print \"Opening mode : \", file_in.mode\n",
"print \"Softspace flag : \", file_in.softspace"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Another way to read a file: python `with()` \n",
"\n",
"* This formulation will automatically close the file after reading it"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"with open(read_this, 'r') as file_in:\n",
" lines = file_in.readlines()\n",
" # file_in.close()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"type(lines)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"lines[0:5]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# print each individual line -- iterate over the file\n",
"for waffles in lines:\n",
" print waffles"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# repr := representation of the line, revealing hidden characters\n",
"for line in lines:\n",
" print repr(line)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Splitting Up Each Line: From a `String` to a `List`"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"some_line = lines[2]\n",
"print some_line\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"some_line[0]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"type(some_line)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# What is the first element of some_line?\n",
"some_line[0]\n",
"\n",
"# is that what you expected?"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"for _ in some_line:\n",
" print _"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print type(some_line)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"line_split_at_commas = some_line.split(\",\") # forms a python List object, splitting at the comma characters\n",
"print line_split_at_commas"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print some_line.strip() # gets rid of whitespace... \n",
"# print type(some_line.strip()) # though result is still a string"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"my_new_line = some_line.strip()\n",
"print my_new_line[0]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"line_list = some_line.strip().split(\",\")\n",
"print line_list"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"for l in line_list:\n",
" print l"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"3 + 3"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"'3' + '3'"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Multiple assignment trick: list unpacking"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"\n",
"animal, body_mass, brain_mass = line_list\n",
"\n",
"print \"Animal = \", animal\n",
"print \"Body Mass =\", float(body_mass)\n",
"print \"Brain Mass =\", float(brain_mass)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Exercise: Which animal has biggest brain-to-body ratio? Smallest?\n",
"\n",
"* Read Animal.txt into a file object\n",
"* Calculate brain / body ratio\n",
" * pay attention to integer arithmetic issues\n",
"* Print ratio for each animal\n",
"* Find min and max.\n",
"* Do your results make sense to you?\n",
"\n",
"* __Bonus 1__: \n",
" * Create variables to keep track of min and max\n",
" * Just print the animal with the max and the animal with the min\n",
"\n",
"* __Bonus 2__:\n",
" * Find average ratio\n"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## pseudo-code\n",
"\n",
"1. read the file into a LIST!!!!!!! here it happens to be called \"lines\"\n",
"2. for item in lines:\n",
"3. clean up the item. i,e., convert the item from a string into a NEW list\n",
"4. figure out which elements are the numbers you want\n",
"5. take the ratio \n",
"6. print the animal name and the ratio"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# List-based solution\n",
"with open(read_this, 'r') as file_in:\n",
" lines = file_in.readlines()\n",
" \n",
"ratios = []\n",
"for line in lines:\n",
" clean_line = line.strip().split(',')\n",
" animal, body_mass, brain_mass = clean_line\n",
" if animal == \"Animal\":\n",
" pass # skip the header line\n",
" else:\n",
" ratio = float(brain_mass) / (1000 * float(body_mass))\n",
" print \"Animal=\",animal, \"\\tratio=\", ratio\n",
" ratios.append([animal, ratio])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# dictionary-based solution\n",
"with open(read_this, 'r') as file_in:\n",
" lines = file_in.readlines()\n",
" \n",
"ratios_dict = {} # empty dictionary\n",
"for line in lines:\n",
" clean_line = line.strip().split(',')\n",
" animal, body_mass, brain_mass = clean_line\n",
" if animal == \"Animal\":\n",
" pass # skip the header line\n",
" else:\n",
" ratios_dict[animal] = float(brain_mass) / (1000 * (float(body_mass))) # WATCH parentheses and order of operations\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print ratios_dict"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print ratios_dict.keys()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print ratios_dict.values()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"print (ratios_dict['Cat'])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"# using the dictionary like a database\n",
"my_study_subjects = ['Cat', 'Rabbit', 'Goat']\n",
"for subj in my_study_subjects:\n",
" print subj, (ratios_dict[subj])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Writing a text file\n",
"\n",
"* Suppose we want to save our calculated brain-to-body ratios"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"file_out_name = 'Animal_brain_to_body_ratio.txt'\n",
"file_out = open(file_out_name, 'w')\n",
"for r in ratios:\n",
" line_out = r[0] + \", \" + str(r[1]) + \"\\n\"\n",
" file_out.write(line_out)\n",
"file_out.close()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Other Tools for Reading and Writing Files \n",
"\n",
"## `csv` module\n",
"\n",
"* somewhat more \"helpful\" way of reading and writing flat files\n",
"* [official documentation](https://docs.python.org/2/library/csv.html)\n",
"\n",
"## pandas \n",
"\n",
"* Beyond file i/o, a powerful set of full-fledged \"Data Wrangling\" tools for python\n",
"* cross-sectional, time-series\n",
"* Intro video at \n",
"* Book info: \n",
"\n",
"## Serialization\n",
"\n",
"* Text files such as .txt and .csv are useful for recording \"flat\" files\n",
"* However sometimes we want to save more complex objects, like dictionaries\n",
"* This is referred to as `serialization`\n",
"* Many serialization tools in python, including \"[pickle](https://docs.python.org/2/library/pickle.html?highlight=pickle#pickle)\"\n",
"\n"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}