19. Strings, II, and formatting

We introduced strings very briefly earlier on. We now take a deeper look at the type/class, its methods, and how to combine them more usefully with data for display/visualization.

19.1. String type

The string type is another ordered collection in Python (we have already seen arrays). The length N = len(STRING) of a given string will be important to consider, stating how many elements are in it. Each element of a string is of a single type, commonly called a character (or char), but it is also really just a string itself with N=1 elements.

From the above, we should be unsurprised to see the following displays of string length, type and values:

print("len of S  :", len(S))
print("type of S :", type(S).__name__)
print("val of elements 0 and 3  :", S[0], S[3])
print("type of elements 0 and 3 :", type(S[0]).__name__, type(S[3]).__name__)

... produce:

len of S  : 13
type of S : str
val of elements 0 and 3  : W k
type of elements 0 and 3 : str str

We have already explored indexing and slicing rules in strings. (And these syntaxes also apply to arrays and other ordered collections.)

Capitalization matters in strings: uppercase and lowercase characters are not equal. Evaluating:

'A' == 'a'

... produces False. And Python knows the difference between a string '5' and an int 5, so that:

'5' == 5

... produces False, as well.

There is a special kind of string called the null string (or empty string), which has the property of having no characters, just an "open" quote followed immediately by the "close" one. As such, its length:

i_am_a_null_str = ''
print(len(i_am_a_null_str))

... is 0. This is an interesting case because bool('') evaluates to False, as if it were the string equivalent of 0 in the int type (hence the name "null string"). Any other string within the bool type conversion, such as bool('Hi'), will evaluate to True.

Q: How many lines are printed in the following?

1s1 = '''Abc'''
2if len(s1) < 5 :
3    print("Am I printed?")
+ show/hide response

Q: How many lines are printed in the following?

1s2 = ''
2if s2 :
3    print("Am I printed?")
+ show/hide response

However, while we can use indices to select string elements, we cannot simply reassign string elements, like we could with arrays. Thus, trying the following:

S[0] = 'T'

... leads to an error:

<ipython-input-349-d9a85c67a48e> in <module>
----> 1 S[0] = 'T'

TypeError: 'str' object does not support item assignment

In Python strings are an unmutable ordered collection, which literally means their elements cannot be reassigned while the rest of the object is unchanged. This is different a different situation than with arrays, which are a mutable kind of collection. However, we will see below <str_and_format_meth> that we can effectively change string elements using some of the class's methods.

19.2. Whitespace and special characters

As seen above when looking at the string S, the space  character is just a normal string element. But it is also a member of a particular group of characters called whitespace, which also includes tabulation (tabs) and newlines. As the name implies, these characters are generally used to provide spaces or breaks between other characters. These breaks are useful for people visualizing text, data output, etc.; they provide spaces between words, vertical alignment and more. Here is a list of whitespace characters in Python:

Whitespace character

Name

Description

 

space

move cursor one space to the right

\t

tab

tabulation moves the cursor to the next predefined "tabulation stop" that are evenly spaced across a line

\n

newline

move cursor to start of next line

You might notice that multiple whitespace characters are written as \ with a following letter. This is a syntax in programming, where the \ is called an escape character: it leads Python to alter the interpretation of one or more characters following it. Thus, abc\td has different interpretation than abctd; we say that \t is an escape sequence, which is typically just the escape and the character following it (though, again, some escape sequences are longer). Having escape sequences basically expands the kinds of things that can be put into a string beyond just letters and numbers (most programming languages have such a syntax, and \ is often the escape character). Here are various whitespace examples in action:

print("Whitespace example with only spaces inserted")
print("Whitespace example with 2\t\ttabs inserted")
print("Whitespace example with a\nnewline char inserted")

... which evaluate to:

Whitespace example with only spaces inserted
Whitespace example with 2            tabs inserted
Whitespace example with a
newline char inserted

Note that \t and \n are each treated as a single character, and also that these don't need to be separated by spaces or anything to be recognized. You can see this by checking the length of the following strings:

print(len("abc d"))
print(len("abc\td))
print(len("abc\nd"))

which is 5 in each case. Also note that tabulation does not insert a set amount of whitespace. Each text editor has evenly spaced locations across a page, and the tabulation makes the cursor "jump" to the next one that appears to the right. This tab spacing varies across both text editor programs and computers. Try the following example on your system:

print("abc\tdefghijklm")
print("abcde\tfghijklm")
print("abcdefg\thijklm")
print("abcdefghi\tjklm")
print("abcdefghijk\tlm")

On our computer, this looks like:

abc     defghijklm
abcde   fghijklm
abcdefg hijklm
abcdefghi       jklm
abcdefghijk     lm

Note

Using tabulation can be a bit unstable if you run code on different systems, because the size of tabs can vary. Tabs can be a quick way to insert vertically-aligned space, but we will see some more broadly stable approaches, below, with string formatting.

Above, the escape character \ is used to make a "normal" character signify something else, such as \t to mean "tab" instead of the letter "t". However, the same syntax is also used to escape the behavior of a "special" character in order to make "normal." One example of this being useful is to include a " character inside a string whose open/close quotes are also ", without needing to change the open/close quotes themselves:

print("The \" is my favorite character.")

... which successfully prints:

The " is my favorite character.

Q: What is another way successfully print " within the above string?

+ show/hide code

However, we might also accidentally escape a character's behavior that we did not want to modify. Consider the following, trying to display the \ itself:

.. code-block::  python

print("The backslash looks like: ")

This produces a syntax error:

  File "<ipython-input-350-536484bd8797>", line 1
    print("The backslash looks like: \")
                                        ^
SyntaxError: EOL while scanning string literal

... where the Python interpreter points to a spot after the actual problem (the acronym EOL is for "end of line"). This happens because by default Python interprets the backslash as escaping whatever follows it, which in this case is the the second quotation marks ": Python no longer recognizes those quotes as closing the string, and keeps searching further in that line. When the EOL is reached before a close of the string, Python complains. To resolve this, one can actually use the escape character itself in order to escape the escape character's escaping behavior (read that twice!):

print("The backslash looks like: \\")

... which produces the appropriate output:

The backslash looks like: \

Escape successful! Note that writing \\ leads to only one backslash being printed, because the first one is on escaping duty.

This \\ is one example of a non-whitespace escape sequence. Some other potentially useful ones are:

Escape sequence

Name

Description

\\

backslash

display single backslash char (escape the escaper)

\', \"

single (or double) quote

display single (or double) quote, and do not interpret it as a string start or finish

\{, \}

left (or right) curly bracket

display left (or right) curly bracket; we will see below why { } is special in "string formatting"

\b

backspace

delete preceding char (putting several in a row will not remove other backspace chars, but will remove that number of other preceding chars)

Q: How would you print the following string?
The tab character looks like: \t.
+ show/hide code

19.3. String operators

The behavior of + and * with strings was discussed earlier.

Briefly, + concatenates two strings, and can be part of a series expression:

print("Hello" + "And" + "Goodbye ")

... outputs: HelloAndGoodbye.

And * can operate on a string together with an int, producing a new string that is that integer number of copies of the original string concatenated:

print("Xyz" * 3)

... outputs: XyzXyzXyz.

Q: These operators can also be combined in expressions. How might one compactly make the following string?
Bafana Bafana and Banyana Banyana each drew their matches.
+ show/hide code

19.4. String methods (examples)

As we have seen in Python, each type generally comes with a host of useful methods and attributes. Here we just touch on the variety of string-specific functionalities that are built-in to Python. Each of the string methods will output new objects, rather than operating in-place (because the string type is immutable), but it is still a good habit to double check the helps when first becoming acquainted with a method.

Note that Python itself typically doesn't distinguish between individual characters whether letter, number, whitespace or another symbol. We can see this in some of the string methods. However, because humans often do, there are several string methods that can help to identifying words and other "lexical" properties.

You can find out more comprehensively about the available string methods with help(str), of which we just highlight some useful ones here:

capitalize

|  capitalize(self, /)
|      Return a capitalized version of the string.
|
|      More specifically, make the first character have upper case and the rest lower
|      case.
|

This method takes no arguments (recall: the self means that the object it is attached to is essentially an input, but we don't name it again when using the method), but it just capitalizes the first character in a string (and :

print('antananarivo'.capitalize())
print("MADAGASCAR".capitalize())
print("tsingy de BEMARAHA".capitalize())
print(" zebu".capitalize())

... produces:

Antananarivo
Madagascar
Tsingy de bemaraha
 zebu

You can see that these results match the description: Python doesn't know about whitespace-separated words within the string to capitalize. And if the first character is whitespace, it doesn't adjust try to find the first non-whitespace character.

count

|  count(...)
|      S.count(sub[, start[, end]]) -> int
|
|      Return the number of non-overlapping occurrences of substring sub in
|      string S[start:end].  Optional arguments start and end are
|      interpreted as in slice notation.
|

This method takes one required argument, sub, which is the "sub"string to search for within the main string. It can then take up to two optional args, to control the starting and ending points of the search:

place = '''Ouagadougou, Burkina Faso'''
place.count('ou')

... outputs: 2 (noting again that Python distinguishes between uppercase and lowercase). It is easy to forget the string-quotes on ou, which would lead to the following error:

----> 2 place.count(ou)

NameError: name 'ou' is not defined

... as Python tried to interpret ou as a variable name (which would be fine if you had defined it as one previously).

Using the options, one has:

print(place.count('ou', 7))     # search from index [7] to the end
print(place.count('ou', 7, 8))  # search from index [7, 8)]

... which produces 1 and 0 respectively (the second case defines a preeetttty narrow subset of the initial string to search within).

find

|  find(...)
|      S.find(sub[, start[, end]]) -> int
|
|      Return the lowest index in S where substring sub is found,
|      such that sub is contained within S[start:end].  Optional
|      arguments start and end are interpreted as in slice notation.
|
|      Return -1 on failure.
|

This method works quite similarly to the count() method, above, but instead of returning how many times sub appears, it returns the "lowest index in S where substring sub is found": that is, the first place in the string that sub is found.

We note that one has to be a little careful about how we interpret the output, because Python recognizes -1 as a valid index. Consider using our variable from above; we could use find() and print the character at the returned index as follows, to verify that the method works:

ii = place.find('B')
print(place[ii])

And this outputs B, so that is fine. But what happens if we try searching of a different letter, which isn't there? Well, the output of the following:

ii = place.find('Z')
print(place[ii])

... is o, so it appears that Python is wrong (!). However, if we print the index, as well, we can see what is happening:

ii = place.find('Z')
print(ii)
print(place[ii])

... because now we recognize that the index is -1, which the docstring tells us to interpret as "not found", and we see that the o appears because place[-1] picks out the last character.

split

|  split(self, /, sep=None, maxsplit=-1)
|      Return a list of the words in the string, using sep as the delimiter string.
|
|      sep
|        The delimiter according which to split the string.
|        None (the default value) means split according to any whitespace,
|        and discard empty strings from the result.
|      maxsplit
|        Maximum number of splits to do.
|        -1 (the default value) means no limit.
|

This is a really useful method for helping us to identify words in a string that contains a line of text. This method will break up a string into a list (another ordered collection we will discuss shortly) of strings, with the split happening at a particular "separator" sep; default sep(arator) is any whitespace. So, consider the following:

sentence = '''Yamoussoukro:\t the political capital of Cote d'Ivoire.'''
print(sentence)
print(sentence.split())

... whose output is:

Yamoussoukro:  the political capital of Cote d'Ivoire.
['Yamoussoukro:', 'the', 'political', 'capital', 'of', 'Cote', "d'Ivoire."]

Note that in each new string in the list, there is no whitespace. We could choose to split the string at, say, only spaces  or, say, at 'it':

print(sentence.split(' '))         # specify sep as arg by position
print(sentence.split(sep='it'))    # specify sep as kwarg

... which outputs:

['Yamoussoukro:\t', 'the', 'political', 'capital', 'of', 'Cote', "d'Ivoire."]
['Yamoussoukro:\t the pol', 'ical cap', "al of Cote d'Ivoire."]

Note

There are many more methods for strings, and it is worth exploring them. The above gives a sampling, and some demonstrations of translating the docstring sections for each. The Practice Problems below include more, as well.

19.5. String formatting: Place data into strings

Often we want to display results of our work at several points in a program: in the middle of calculations to check the code's progress and/or to see that intermediate results are running; at the end to display the final results. Additionally, we might want to make labels for plotting, or save data to an output file for further use. To do this, we have to fully understand what we are outputting (the type, length/shape of object, etc.). Then decisions have to be made like how many decimal places to display, how to include text describing what each number is in many cases, how to make aligned columns of results, etc. This all comes under the category of string formatting: inserting the quantities of interest into strings and specifying what display properties it should have.

19.5.1. Basic format method

Up until this point, we have used print() to display strings and other types very simply (as described here), by separating each item with commas:

print("Finished.")
print("x =", x)
print("Avec =", A, "and Bvec =", B)

etc. We now look at more interesting ways to insert data into strings and format the results.

There are different methods and styles for performing this kind of operation in Python, but we will primarily use the modern, built-in string format() method". From scrolling down/searching the help(str) docstring, we see that this method takes both positional and keyword arguments:

...
format(...)
  S.format(*args, **kwargs) -> str

  Return a formatted version of S, using substitutions from args and kwargs.
  The substitutions are identified by braces ('{' and '}').
...

The basic approach for this is to write a string with {} positioned as a placeholder each time we will want to insert a value somewhere, and then we provide the values themselves as arguments. The values can be either variables or expressions to be evaluated. For example:

 1x = 5
 2print("x = {}".format(x))
 3
 4y = -15.5
 5print("If y = {}, then five minus it = {}".format(y, 5-y))
 6
 7z = "Xhosa"
 8print("The first two letters of '{}' are {}.".format(z, z[:2]))
 9
10Avec = np.arange(-2, 2)
11Bvec = np.ones(2, dtype=bool)
12print("\nAvec = {} and Bvec = {}".format(Avec, Bvec))

produces:

x = 5
If y = -15.5, then five minus it = 20.5
The first two letters of 'Xhosa' are Xh.

Avec = [-2 -1  0  1] and Bvec = [ True  True]

In this usage, if we have N values to insert, we will reserve N spaces in string with curly brackets {} and also have N, comma-separated values within the format(...) method. And note from the example of printing arrays, the above works well for short arrays, but for longer ones, we might want to make a loop display indices:

1for idx in range( len(Avec) ):
2    print("[{}]th value is : {}".format(idx, Avec[idx]))

... produces:

[0]th value is : -2
[1]th value is : -1
[2]th value is : 0
[3]th value is : 1

This provides a nicer way to view large arrays, lists or other ordered collections. In the next section we will see more options to format this even more precisely, such as controlling vertical alignment, spacing and more.

Note

Here and in discussion below, we generally print the strings that are being formatted. This is just because we want to quickly display the results. However, in each case, we could alternatively save the results to a variable with assignment, such as:

z   = "Xhosa"
var = "The first two letters of '{}' are {}.".format(z, z[:2])

... and use it further in some way. We mention this so there is no misapprehension that string formatting only occurs inside print()-- instead, it is quite general and useful.

19.5.2. Format specifiers: string type

Beyond just displaying a variable, we can control several aspects of spacing, alignment and decimal output (where appropriate) using format specifiers. These are placed within each pair of curly brackets whose contents you want to format. The specifier starts with a colon :, can then contain some options, and typically ends with the type. We first look at options for the string type, which is denoted by ending the format specifier with an s.

One of the most common formatting options for strings is to create a window of a certain width (in terms of number of characters), into which the string value is inserted, and the next characters appear after that window. If the string length is greater than the window's width, then it is effectively as if the window weren't specified Additionally, one can specify whether to use left (<), right (>) or center (^) justification for the string within that window. Consider the following:

ss = 'abc'
print("0123456789|")            # just making a reference "count" of spaces
print("{:10s}|".format(ss))     # window = 10 spaces, str = def loc (left)
print("{:<10s}|".format(ss))    # window = 10 spaces, str = left loc
print("{:^10s}|".format(ss))    # window = 10 spaces, str = center loc
print("{:>10s}|".format(ss))    # window = 10 spaces, str = right loc

... which produces:

0123456789|
abc       |
abc       |
   abc    |
       abc|

The vertical | shows where the edge of the specified 10 char window is in each case. Note how the center justification is "as central as possible", depending on the relative even/oddness of the window and inserted string.

When specifying justification, you can also choose to fill the spaces in the window with a particular character, such as in the following:

print("{:.<10s}|".format(ss))
print("{:-^10s}|".format(ss))
print("{:z>10s}|".format(ss))

... which produces:

abc.......|
---abc----|
zzzzzzzabc|

19.5.3. Format specifiers: numerical type

Python has a lot of numerical types, so perhaps we shouldn't be surprised that there are also several numerical specifier types (and here we only consider scalar/non-collection ones). Some of the more common ones include:

Type

Description

b

binary

d

integer

e, E

scientific notation float, with "e" or "E", respectively

f

"fixed point" float (i.e., standard decimal notation)

g, G

"general" format: if the value is within a couple magnitudes of 0, display as "fixed point"; else, using scientific notation

%

convert to percent (multiplies by 100) and display with % at end

Note

Mixing specifier and variable types between string and numerical types (e.g., trying to put an int into a :s specifier, or a str into a :f specifier) produces an error.

Within numerical types, trying to put an float into :d fails, but putting an int into :f is allowed.

So:

dd = 45
print("{:d}".format(dd))
print("{:e}".format(dd))
print("{:E}".format(dd))
print("{:f}".format(dd))
print("{:%}".format(dd))

... produces:

45
4.500000e+01
4.500000E+01
45.000000
4500.000000%

Python appears to default to a precision of 6 decimal places across all relevant specifier types. This can be controlled by using .NUMBER in the specifier as follows:

print("PI is approx: {}".format(np.pi))         # def with no specifier
print("PI is approx: {:f}".format(np.pi))       # def for fixed point spec
print("PI is approx: {:.2f}".format(np.pi))     # 2 decimals
print("PI is approx: {:.25f}".format(np.pi))    # 25 decimals

Note that the displayed output is rounded, not just truncated:

PI is approx: 3.141592653589793
PI is approx: 3.141593
PI is approx: 3.14
PI is approx: 3.1415926535897931159979635

The window and justification syntax from the above string formatting applies equivalently to numerical types, and it can be combined with the precision specification, too:

ff = 0.123
print("{:10f}|".format(ff))
print("{:<10f}|".format(ff))
print("{:^10.3f}|".format(ff))
print("{:>10.0f}|".format(ff))

... produces:

  0.123000|
0.123000  |
  0.123   |
         0|

Though we see that for numerical types, values are displayed with right justification by default (instead of left, for strings).

19.5.4. Format specifiers: practical example

The above is fun, sure, by why would we need to spend all this time thinking about windows for variables and numbers of decimal places?

Well, when we want to display data, vertical alignment and horizontal displacement can make a bit difference in being able to quickly (and accurately) assess output. Consider the following example, where we display an array:

C = np.array([-18.5, 300.1234, 0.1, 99.9999999, 25, 16, 91, 0.032, -14.4, 0, 1, 56])
N = len(C)
for i in range(N):
    print("val [{}] --> {}".format(i, C[i]))

This produces:

val [0] --> -18.5
val [1] --> 300.1234
val [2] --> 0.1
val [3] --> 99.9999999
val [4] --> 25.0
val [5] --> 16.0
val [6] --> 91.0
val [7] --> 0.032
val [8] --> -14.4
val [9] --> 0.0
val [10] --> 1.0
val [11] --> 56.0

Because the array elements have no vertical alignment, it's hard to tell where the largest or smallest value might be: just because a number has a lot of decimal places doesn't mean it's large. Additionally, since there are more than 10 elements, the indices change from being single to double digit, which also makes it difficult to check the numbers. So, let's try to fix this with format specifiers. There is a lot of freedom here, but consider the following:

for i in range(N):
    print("val [{:2}] --> {:15.8f}".format(i, C[i]))

... which produces:

val [ 0] -->    -18.50000000
val [ 1] -->    300.12340000
val [ 2] -->      0.10000000
val [ 3] -->     99.99999990
val [ 4] -->     25.00000000
val [ 5] -->     16.00000000
val [ 6] -->     91.00000000
val [ 7] -->      0.03200000
val [ 8] -->    -14.40000000
val [ 9] -->      0.00000000
val [10] -->      1.00000000
val [11] -->     56.00000000

This looks much easier to assess! Again, details will matter a lot: we kind of have to know the largest index value to know how many spaces it might take up; we have to know the type and relevant number of decimal places in the array; etc. However, once we know those things, we can really make nice visual displays with just a couple parameters.

19.5.5. Format specifiers: nesting (bonus)

As a small, extra (but fun!) point about format specifiers: the specifier options themselves can be inserted, via nested specifiers. That is, consider the following:

print("The values are:  |{:{}.{}f}|  |{:{}.{}f}|".format(1.234567, 15, 9, 9.87, 10, 5))

... which produces:

The values are:  |    1.234567000|  |   9.87000|

What is happening here? Well, by counting the curly braces, there are 6 values to insert. How do we know the order in which values will be placed? Well, start from left to right; when we get a curly bracket, that will get the first value. We then check for any inner curly brackets, and if there are some, go left-right filling in with the next values. When done, continue through the string for any more "outer" curly brackets, and repeat. Thus, the above is equivalent to:

print("The values are:  |{:15.9f}|  |{:10.5f}|".format(1.234567, 9.87))

(You can verify this more precisely by setting the string currently in each print() to a variable, and use == to check strict equality.)

19.6. Practice

  1. Print the following string:
    I don't really care for ", ', ''' or """.  But \ is OK.
  2. Use the str method replace to:

    1. put your own name in the place of who? in this string: My name is who?.

    2. swap old and new in the following string: What's old is now new!.

  3. Use the str method strip to:

    1. trim whitespace from around (but not within) the sentence in this str: \tHere is the sentence, by itself.  \n

    2. remove the dashes from ------this string---------

    3. leave only the word center in: nnnnncenternnnnn

    4. What is the name of the related method to strip whitespace (or other characters) from only the right side of the string?

  4. Write a for-loop to see how many orders of magnitude from zero a number has to be before the g format specifier changes from displaying a fixed point float to using scientific notation. Investigate using powers of 10, with both negative and positive exponentials.

  5. Can you draw the following picture using only one print statement?

    draw a triangle