26. Reading and writing files

In progress.

In this section, we look at some file I/O: file input and output.

One common use of programming is to read in data and to perform some kind of analysis and/or visualization with it. This is called reading in a file. The details of tdoing this depend on the format of how the data are stored in the first place, but we will take at some common, simple storage types, as well as principles that can be applied in more complicated cases.

Complementarily, we might want to output and store data that we have created, cleaned or otherwise processed. We have seen a bit about making plots to display, but we might also want to save the data in a manner that it could be read in later and further processed. This is called writing out a file, and again there are many ways to do this, using formats such as: data tables, comma-separated variables (CSV), tab-separated variables (TSV), binary (space-efficient, but not directly readable as a simple text file), and more. We look at a couple simple cases here.

A general note will be that the more you know about the data, the better off you will be whether reading or writing a file. File extensions (the last few letters of a typical file name after a '.', such as "pdf", "txt", "py", "doc" or more) should describe the file format, but sometimes extra checks are necessary. When dealing with real data, too, there can be mistakes and other practical considerations, like having missing data. So, please remember to always check your results.

The following modules are used below:

import numpy as np
import copy, sys

26.1. Writing files

Consider the case of having a 2D array of data to store in a file; here, we will start with a small dataset of a 5\times2 array of floats:

data = np.array([[43.2, 4], [42.7, 21.5], [7.2, 30], [30, 16], [102, 38.1]])

There are many different styles, or formats, we could use to write this data to a file. We have to decide: whether to write as human-readable ASCII characters or as binary; how do we want to separate the information (spaces, tabs, commas, etc.); if we want to include comments, such as where the data came from; are there column labels/headers present; and more. We will discuss some of these points below, but we will start by outputting the data as ASCII characters, for easier visual verification.

26.1.1. Basic Formatting

Perhaps the most straightforward way to think of writing this to a file would be to:

  • first think plan how we would want to display the values using the standard print function,

  • then just switch the same segment of code to writing to a file instead of displaying to the notebook or terminal.

The above paradigm sounds reasonable; we just have one additional note. The print function adds a \n character at the end of every call, by having a default kwarg end='\n' (see the function's help). However, the writing function does not include this, so we will provide the following kwarg to our print calls to better emulate it, via the null character: end=''.

While we can use print(data) to show all the values, that would also include the square brackets and other non-numerical information, which probably do not want. Instead, we can loop over the values and print them, while also taking advantage of string formatting. We also have to decide on a delimiter, which is the character(s) used to separate individual values or fields. Common choices are whitespace (and sometimes even more specifically, tabs) or a comma. Here, we start with (non-tab) whitespace.

The data are floating point values, with a maximum precision of 1 decimal place. So, we start with the following string formatting for each value:

1nrow, ncol = np.shape(data)
2
3for i in range(nrow):
4    for j in range(ncol):
5        print("{:6.1f}".format(data[i, j]), end='')

... which produces:

   43.2   4.0  42.7  21.5   7.2  30.0  30.0  16.0 102.0  38.1

Well, all the numbers are written out, and we could write to file like this and just instruct someone to read this in as a 5\times2 array (that kind of "extra" information to read the data properly is called meta-data). However, let's see if we can reflect the shape of the data in the output. How can we control that? Well, by making use our friendly newline character.

We want to display the data as 5 rows of 2 columns, so after printing each row we would like a newline. Thus we could put one after looping through the contents of a given row:

1for i in range(nrow):
2    for j in range(ncol):
3        print("{:6.1f}".format(data[i, j]), end='')
4    print("\n", end='')

... which produces:

  43.2   4.0
  42.7  21.5
   7.2  30.0
  30.0  16.0
 102.0  38.1

That seems a reasonable way to represent the data! Note that the vertical alignment is aesthetically pleasing and useful for humans checking the data, but it is the whitespace separation that is important in determining the separation of values and structure. The vertical column alignment could be made uneven, and the file could still be read in properly as a whitespace-delimited file.

26.1.2. Basic file writing: Whitespace separation

To write the data to a file, we follow a 3 step process:

  1. Have Python open a file in "writing" mode.

  2. Call a writing function as many times as we wish to put information into the file.

  3. Close the file nicely.

For Step 1, there is a built-in Python function called open(), whose help we can look at with open?:

 1Signature:
 2open(
 3    file,
 4    mode='r',
 5    buffering=-1,
 6    encoding=None,
 7    errors=None,
 8    newline=None,
 9    closefd=True,
10    opener=None,
11)
12Docstring:
13Open file and return a stream.  Raise OSError upon failure.
14
15file is either a text or byte string giving the name (and the path
16if the file isn't in the current working directory) of the file to
17be opened or an integer file descriptor of the file to be
18wrapped. (If a file descriptor is given, it is closed when the
19returned I/O object is closed, unless closefd is set to False.)
20
21... The available modes are:
22
23========= ===============================================================
24Character Meaning
25--------- ---------------------------------------------------------------
26'r'       open for reading (default)
27'w'       open for writing, truncating the file first
28'x'       create a new file and open it for writing
29'a'       open for writing, appending to the end of the file if it exists
30'b'       binary mode
31't'       text mode (default)
32'+'       open a disk file for updating (reading and writing)
33'U'       universal newline mode (deprecated)
34========= ===============================================================
35
36The default mode is 'rt' (open for reading text). For binary random
37access, the mode 'w+b' opens and truncates the file to 0 bytes, while
38'r+b' opens the file without truncation. The 'x' mode implies 'w' and
39raises an `FileExistsError` if the file already exists.

So, the main parameters we need to supply are the filename (first arg by position) and the mode kwarg (to set to writing, mode='w'). Python is set to read/write in text mode by default, which suits our purposes here, so we don't need to set that additionally (e.g., with mode='wt'). We will also write to a newfile, so we don't need to append.

When we run the open function, a "stream" is opened for letting I/O flow to or from a file; we assign this to a variable. With the stream to the file open in write-mode, we then can perform Step 2 using this streaming object's write() method in place of our above print calls (without the kwarg). Finally, when we are done sending information, we use the close() method of the streaming object to perform Step 3. Thus, to write our output to a file called "test_write_01.txt", we could do the following:

1fff = open('test_write_01.txt', mode='w')       # Step 1: open file to write
2
3# Step 2: write contents
4for i in range(nrow):
5    for j in range(ncol):
6        fff.write("{:6.1f}".format(data[i, j]))
7    fff.write("\n")
8
9fff.close()                                     # Step 3: close file

You can verify that the text file output contains 5 rows and 2 columns, as above.

26.1.3. Defining a function to write a file

It is possible that we would like to be able to write arrays of floats many times. For convenience, we can think of wrapping up our above code into a function, since it constitutes a well-defined and fairly self-contained job to do.

What "inputs" do we have? For starters, the array that we want to write, and probably the name of the output file. And what "outputs" are there? Well, it doesn't seem like there is anything to return, since the main output is written to a file.

Therefore, an initial go at what we want could look like:

 1def write_arr_float(data, fname):
 2    """Take a 2D array of floats data and write it to a
 3    whitespace-separated text files fname."""
 4
 5    nrow, ncol = np.shape(data)
 6
 7    fff = open(fname, mode='w')
 8
 9    for i in range(nrow):
10        for j in range(ncol):
11            fff.write("{:6.1f}".format(data[i, j]))
12        fff.write("\n")
13
14    fff.close()
15
16    print("Wrote {}x{} float arr to file: {}".format(nrow, ncol, fname))

... where we kept the same variables we used above, just introducing a string fname to be the name of the output file.

By writing a function, we are generalizing our lines of code to be used for many different arrays: instead of writing just one specific array called data, the user can try to write any array they have with this function. But we have to be careful: this function won't work with any array, even as we note in the docstring: the array must be 2D, and the elements must be floats. For example, if you try to pass a 3D array in, we will get an error when np.shape is called. We should probably do more than just politely ask the user to provide specific inputs, we should put some checks (using conditions!) at the start of the function.

Q: Copy the above function. Try writing some checks for the given inputs, such as those mentioned above.

+ show/hide code

26.1.4. Basic file writing: CSV

In a comma-separated variable (CSV) file, values within the line are, well, separated by commas; that is, the delimiter is ','. There is no comma at the end of a line, and there can also be spaces within a line (but not always).

We could adopt the same approach as above and just add a ',' within the string being printed/written, or perhaps add a separate print/write with the comma. Those might require some if-conditions to not place an unwanted comma at the end of a line, which is certainly fine.

Here, we'll choose to look at a slightly different way of addressing the problem. We can basically think of each line as a list of string entries with a comma between them, so why not actually make a list of strings and join them with a comma (in the sense of the string method join())?

In this case, we iterate through each row, starting an empty list each time. We then iterate through all the columns, appending strings of our desired format. After we finish building the list for the row, we can join all the elements with our comma delimiter, and perhaps even include the newline character we want at the same time. Thus:

fff = open('test_write_02.txt', mode='w')       # Step 1: open file to write

# Step 2: write contents
for i in range(nrow):
    row = []
    for j in range(ncol):
        row.append("{:6.1f}".format(data[i, j]))
    fff.write(",".join(row) + "\n")

fff.close()                                     # Step 3: close file

... produces a file that looks like:

 43.2,   4.0
 42.7,  21.5
  7.2,  30.0
 30.0,  16.0
102.0,  38.1

Note that we could use this same methodology to the earlier whitespace-separated case, just replacing the delimiter with a space.

26.2. Reading files

To write an array to a file, above, we looped over row elements to build up a line of text, and then built up successive lines by looping over each row. To read a file, we will essentially reverse that process. Again, there are maaany ways a file can be formatted; here we focus on ASCII text files of numbers. And again, knowing about the file formatting ahead of time is vital: what type of values are in the file (float, int, str, ...)? what is the delimiter being used (whitespace, comma, tabs, ...)?

26.2.1. Basic file reading: Whitespace separation

To read a text file in Python, we will do the following steps:

  1. Open a file in "read" mode.

  2. Read in the contents of the file, storing content line by line.

  3. Close the filestream.

For Step 1, we will use our old built-in Python friend open(), introduced above for writing a file . We use it in similar syntax, just specifying the mode to be for reading. The result of open() is assigned to a variable, whose methods we use in the other steps.

To perform Step 2, we will use a filestream method called readlines() that will go through the opened file from start to finish. It turns each line into a string, and stores these as successive elements in a list, which we can assign to a variable; if there are N lines in the file, then the length of the returned list will also be N.

Finally, as above, we finish by closing the filestream in Step 3, using the same close() method shown above.

An example of using the above steps to read the whitespace-separated file we created a little while ago (or if you skipped that part, download it here) is:

1fname = 'test_write_01.txt'
2
3fff   = open(fname, 'r')    # Step 1: open a file for reading
4X     = fff.readlines()     # Step 2: read in, line-by-line; X is a list
5fff.close()                 # Step 3: close the stream

Let us take a look at what X holds:

print(X)
print(type(X))
print(len(X))

... which outputs:

['  43.2   4.0\n', '  42.7  21.5\n', '   7.2  30.0\n', '  30.0  16.0\n', ' 102.0  38.1\n']
<class 'list'>
5

As promised, X is 1 dimensional list whose elements are strings showing the contents of each line of the file, in order. Note that all the spacing of the lines are present, even newline char that ends each line. Thus, Python doesn't try to guess anything about the content being numbers or how they are separated per line: that is up to us to work out.

Well, we know that each line contains a row of input data. We can now add the following steps to our "reading in data" procedure:

  1. loop over each element of X (= get every line from the file); we know how many lines there are, from the list length

  2. split the element string at whitespace, which is delimiting our columns (= get a list of individual values),

  3. and, since each separated value is still a string at this point, we can convert each number to float.

We can kind of picture this as having all the values stuck in a block of ice (the list X), and we want to systematically chip away the non-number parts and get our number values out. As part of this, we need to know how we want to store the values as we get them.

At the start, we only know how many lines there were in the original file (via the len(X), not how many columns, because each line was read in as a single string. When we split each row string with our chosen delimiter, we will get a list of all the elements that were sitting there in the first place: so perhaps it makes sense to start an empty list, and accumulate this row lists as we get them? We can always convert our list to an array at the end.

Here is a start of the process over Steps 4 and 5 above, printing as we go:

1N = len(X)
2Y = []                          # init output list: will append each row
3
4for i in range(N):              # Step 4: loop over all rows
5    row = X[i].split()          # Step 5: split the [i]th row at whitespace
6    print("[{}]th row: {}".format(i, row))
7    Y.append(row)
8
9print("Final list looks like:\n", Y)

... which produces:

[0]th row: ['43.2', '4.0']
[1]th row: ['42.7', '21.5']
[2]th row: ['7.2', '30.0']
[3]th row: ['30.0', '16.0']
[4]th row: ['102.0', '38.1']
Final list looks like:
 [['43.2', '4.0'], ['42.7', '21.5'], ['7.2', '30.0'], ['30.0', '16.0'], ['102.0', '38.1']]

We are doing pretty well so far: we now have a 2D list of the correct dimensions (5\times2), and the elements mostly look like what we want. However, we cannot convert this to an array yet, because the "numbers" are still strings within the list. So, as we go through each row, we could convert element of row to being a float:

 1N = len(X)
 2Y = []                          # init output list: will append each row
 3
 4for i in range(N):              # Step 4: loop over all rows
 5    row = X[i].split()          # Step 5: split the [i]th row at whitespace
 6    for j in range(len(row)):   # Step 6: overwrite each ele with float ver
 7        row[j] = float(row[j])
 8    print("[{}]th row: {}".format(i, row))
 9    Y.append(row)
10
11print("Final list looks like:\n", Y)

At this point, we now have a 2D list of the right dimensions and each element is the type we wanted. The final, final step could be to convert this to a more mathematical object, an array:

Yarr = np.array(Y)
print(Yarr)

... which displays:

[[ 43.2   4. ]
 [ 42.7  21.5]
 [  7.2  30. ]
 [ 30.   16. ]
 [102.   38.1]]

Note some points about this process. We did note really have to know about how many lines or how many columns of data we had: those pieces of information could be derived from our objects as we went. We did, however, have to know how the values were separated: if we had a CSV file, for example, we could use the above code with the small adjustment of splitting with the appropriate delimiter ','. We also had to know what type of numbers were stored, because of the need for explicit type conversion of each row element.

We also implicitly relied on the fact that each row had the same length (same number of elements: no missing data); in the imperfect world of real data, we might have to prepare for such eventualities, but we leave that for another discussion. Similarly, we could try to generalize the above process a bit, but that can quickly get more complicated and assumption specific, so we will also leave that for a later date.

26.2.2. Defining a function to read a file

Analogously to turning our file-writing code into a function, we can turn our above work into file-reading function. We can take a look at the final blocks of code for performing all of the Steps 1-6 above, and decide what our inputs and outputs are.

Well, the main input seems to be just the filename: everything else was derived from the data itself. And this time, we will have an output: the final array of values. So:

 1def read_file_float_arr(fname):
 2    """Read a file fname and return an array of its values.
 3
 4    The file should contain floating point values arranged in a 1-
 5    or 2-D array."""
 6
 7    fff   = open(fname, 'r')
 8    X     = fff.readlines()
 9    fff.close()
10
11    N = len(X)
12    Y = []
13
14    for i in range(N):
15        row = X[i].split()
16        for j in range(len(row)):
17            row[j] = float(row[j])
18        Y.append(row)
19
20    Yarr = np.array(Y)
21    return Yarr

As noted in the docstring, this code should actually work to read in any 1D- or 2D-like files, since the former could be considered just a 2D array with 1\times N or N\times1.

As a final note, the above code could be condensed a little bit using some of Python's conveniences. For example, the first loop could be changed to make X itself the iterable (and we no longer need to calculate N); the second loop, which defines row, could be changed to list comprehension.

Q: Take a minute and try defining a new function read_file_float_arr_condensed() with these changes.

+ show/hide code

Q: Can you copy the above function and apply list comprehension?

+ show/hide code

26.3. Practice

  1. Consider a dataset with similar values to data, above, but with the following change:

    data2 = copy.deepcopy(data)
    data2[3,1] = 1111.1
    

    Using the basic example above for data2 would lead to the following output:

     43.2   4.0
     42.7  21.5
      7.2  30.0
     30.01111.1
    102.0  38.1
    

    ... which has a formatting problem (two values are merged undesirably). Adjust the print/write functions used above to fix this bad output and retain 2 columns. Can you adjust the print/write functions so that this merging of two columns never happens, regardless of how large the float values become (even if it means the columns lose their visually pleasing vertical alignment)?

  2. Adapt the functionality in the above file-writing examples to define a function write_arr_float_gen that uses a kwarg to write a file as a CSV or whitespace separation (or, indeed, allows the user to define any delimiter they want).

  3. There is a function in NumPy called np.savetxt() that provides some functionality to write an array as a simple text file, similar to what we did in the first section, above. Investigate its functionality, and use its args and kwargs to write a 2D array with the same style of formatting we used, above, for each of the whitespace and CSV cases.