19. Strings, II, and formatting¶
We introduced strings very briefly earlier on. We now take a deeper look at the type/class, its methods, and how to combine them more usefully with data for display/visualization.
19.1. String type¶
The string type is another ordered collection in Python (we have
already seen arrays). The length N = len(STRING)
of a given string will be important to consider, stating how many
elements are in it. Each element of a string is of a single type,
commonly called a character (or char), but it is also really
just a string itself with elements.
From the above, we should be unsurprised to see the following displays of string length, type and values:
print("len of S :", len(S))
print("type of S :", type(S).__name__)
print("val of elements 0 and 3 :", S[0], S[3])
print("type of elements 0 and 3 :", type(S[0]).__name__, type(S[3]).__name__)
... produce:
len of S : 13
type of S : str
val of elements 0 and 3 : W k
type of elements 0 and 3 : str str
We have already explored indexing and slicing rules in strings. (And these syntaxes also apply to arrays and other ordered collections.)
Capitalization matters in strings: uppercase and lowercase characters are not equal. Evaluating:
'A' == 'a'
... produces False
. And Python knows the difference between a
string '5' and an int 5, so that:
'5' == 5
... produces False
, as well.
There is a special kind of string called the null string (or empty string), which has the property of having no characters, just an "open" quote followed immediately by the "close" one. As such, its length:
i_am_a_null_str = ''
print(len(i_am_a_null_str))
... is 0
. This is an interesting case because bool('')
evaluates to False
, as if it were the string equivalent of 0 in
the int type (hence the name "null string"). Any other string within
the bool
type conversion, such as bool('Hi')
, will evaluate to
True
.
Q: How many lines are printed in the following?
1s1 = '''Abc'''
2if len(s1) < 5 :
3 print("Am I printed?")
Q: How many lines are printed in the following?
1s2 = ''
2if s2 :
3 print("Am I printed?")
However, while we can use indices to select string elements, we cannot simply reassign string elements, like we could with arrays. Thus, trying the following:
S[0] = 'T'
... leads to an error:
<ipython-input-349-d9a85c67a48e> in <module>
----> 1 S[0] = 'T'
TypeError: 'str' object does not support item assignment
In Python strings are an unmutable ordered collection, which literally means their elements cannot be reassigned while the rest of the object is unchanged. This is different a different situation than with arrays, which are a mutable kind of collection. However, we will see below <str_and_format_meth> that we can effectively change string elements using some of the class's methods.
19.2. Whitespace and special characters¶
As seen above when looking at the string S
, the space character is just a normal string element. But it is also a
member of a particular group of characters called whitespace,
which also includes tabulation (tabs) and newlines. As the name
implies, these characters are generally used to provide spaces or
breaks between other characters. These breaks are useful for people
visualizing text, data output, etc.; they provide spaces between
words, vertical alignment and more. Here is a list of whitespace
characters in Python:
Whitespace character
Name
Description
space
move cursor one space to the right
\t
tab
tabulation moves the cursor to the next predefined "tabulation stop" that are evenly spaced across a line
\n
newline
move cursor to start of next line
You might notice that multiple whitespace characters are written as
\
with a following letter. This is a syntax in programming, where
the \
is called an escape character: it leads Python to alter
the interpretation of one or more characters following it. Thus,
abc\td
has different interpretation than abctd
; we say that
\t
is an escape sequence, which is typically just the escape
and the character following it (though, again, some escape sequences
are longer). Having escape sequences basically expands the kinds of
things that can be put into a string beyond just letters and numbers
(most programming languages have such a syntax, and \
is often the
escape character). Here are various whitespace examples in action:
print("Whitespace example with only spaces inserted")
print("Whitespace example with 2\t\ttabs inserted")
print("Whitespace example with a\nnewline char inserted")
... which evaluate to:
Whitespace example with only spaces inserted
Whitespace example with 2 tabs inserted
Whitespace example with a
newline char inserted
Note that \t
and \n
are each treated as a single character,
and also that these don't need to be separated by spaces or anything
to be recognized. You can see this by checking the length of the
following strings:
print(len("abc d"))
print(len("abc\td))
print(len("abc\nd"))
which is 5
in each case. Also note that tabulation does not
insert a set amount of whitespace. Each text editor has evenly spaced
locations across a page, and the tabulation makes the cursor "jump" to
the next one that appears to the right. This tab spacing varies
across both text editor programs and computers. Try the following
example on your system:
print("abc\tdefghijklm")
print("abcde\tfghijklm")
print("abcdefg\thijklm")
print("abcdefghi\tjklm")
print("abcdefghijk\tlm")
On our computer, this looks like:
abc defghijklm
abcde fghijklm
abcdefg hijklm
abcdefghi jklm
abcdefghijk lm
Note
Using tabulation can be a bit unstable if you run code on different systems, because the size of tabs can vary. Tabs can be a quick way to insert vertically-aligned space, but we will see some more broadly stable approaches, below, with string formatting.
Above, the escape character \
is used to make a "normal" character
signify something else, such as \t
to mean "tab" instead of the
letter "t". However, the same syntax is also used to escape the
behavior of a "special" character in order to make "normal." One
example of this being useful is to include a "
character inside a
string whose open/close quotes are also "
, without needing to
change the open/close quotes themselves:
print("The \" is my favorite character.")
... which successfully prints:
The " is my favorite character.
However, we might also accidentally escape a character's behavior that
we did not want to modify. Consider the following, trying to display
the \
itself:
.. code-block:: python
print("The backslash looks like: ")
This produces a syntax error:
File "<ipython-input-350-536484bd8797>", line 1
print("The backslash looks like: \")
^
SyntaxError: EOL while scanning string literal
... where the Python interpreter points to a spot after the
actual problem (the acronym EOL is for "end of line"). This
happens because by default Python interprets the backslash as escaping
whatever follows it, which in this case is the the second quotation
marks "
: Python no longer recognizes those quotes as closing the
string, and keeps searching further in that line. When the EOL is
reached before a close of the string, Python complains. To resolve
this, one can actually use the escape character itself in order to
escape the escape character's escaping behavior (read that twice!):
print("The backslash looks like: \\")
... which produces the appropriate output:
The backslash looks like: \
Escape successful! Note that writing \\
leads to only one
backslash being printed, because the first one is on escaping duty.
This \\
is one example of a non-whitespace escape sequence. Some
other potentially useful ones are:
Escape sequence |
Name |
Description |
---|---|---|
|
backslash |
display single backslash char (escape the escaper) |
|
single (or double) quote |
display single (or double) quote, and do not interpret it as a string start or finish |
|
left (or right) curly bracket |
display left (or right) curly bracket; we will see below why
|
|
backspace |
delete preceding char (putting several in a row will not remove other backspace chars, but will remove that number of other preceding chars) |
19.3. String operators¶
The behavior of +
and *
with strings was discussed
earlier.
Briefly, +
concatenates two strings, and can be part of a series
expression:
print("Hello" + "And" + "Goodbye ")
... outputs: HelloAndGoodbye
.
And *
can operate on a string together with an int, producing a
new string that is that integer number of copies of the original
string concatenated:
print("Xyz" * 3)
... outputs: XyzXyzXyz
.
Bafana Bafana and Banyana Banyana each drew their matches.
19.4. String methods (examples)¶
As we have seen in Python, each type generally comes with a host of useful methods and attributes. Here we just touch on the variety of string-specific functionalities that are built-in to Python. Each of the string methods will output new objects, rather than operating in-place (because the string type is immutable), but it is still a good habit to double check the helps when first becoming acquainted with a method.
Note that Python itself typically doesn't distinguish between individual characters whether letter, number, whitespace or another symbol. We can see this in some of the string methods. However, because humans often do, there are several string methods that can help to identifying words and other "lexical" properties.
You can find out more comprehensively about the available string
methods with help(str)
, of which we just highlight some useful
ones here:
capitalize
| capitalize(self, /)
| Return a capitalized version of the string.
|
| More specifically, make the first character have upper case and the rest lower
| case.
|
This method takes no arguments (recall: the
self
means that the object it is attached to is essentially an
input, but we don't name it again when using the method), but it
just capitalizes the first character in a string (and :
print('antananarivo'.capitalize())
print("MADAGASCAR".capitalize())
print("tsingy de BEMARAHA".capitalize())
print(" zebu".capitalize())
... produces:
Antananarivo
Madagascar
Tsingy de bemaraha
zebu
You can see that these results match the description: Python doesn't know about whitespace-separated words within the string to capitalize. And if the first character is whitespace, it doesn't adjust try to find the first non-whitespace character.
count
| count(...)
| S.count(sub[, start[, end]]) -> int
|
| Return the number of non-overlapping occurrences of substring sub in
| string S[start:end]. Optional arguments start and end are
| interpreted as in slice notation.
|
This method takes one required argument, sub
, which is the
"sub"string to search for within the main string. It can then take up
to two optional args, to control the starting and ending points of the
search:
place = '''Ouagadougou, Burkina Faso'''
place.count('ou')
... outputs: 2
(noting again that Python distinguishes between
uppercase and lowercase). It is easy to forget the string-quotes on
ou
, which would lead to the following error:
----> 2 place.count(ou)
NameError: name 'ou' is not defined
... as Python tried to interpret ou
as a variable name (which
would be fine if you had defined it as one previously).
Using the options, one has:
print(place.count('ou', 7)) # search from index [7] to the end
print(place.count('ou', 7, 8)) # search from index [7, 8)]
... which produces 1
and 0
respectively (the second case
defines a preeetttty narrow subset of the initial string to search
within).
find
| find(...)
| S.find(sub[, start[, end]]) -> int
|
| Return the lowest index in S where substring sub is found,
| such that sub is contained within S[start:end]. Optional
| arguments start and end are interpreted as in slice notation.
|
| Return -1 on failure.
|
This method works quite similarly to the count()
method, above,
but instead of returning how many times sub
appears, it returns
the "lowest index in S where substring sub is found": that is, the
first place in the string that sub
is found.
We note that one has to be a little careful about how we interpret the
output, because Python recognizes -1
as a valid index. Consider
using our variable from above; we could use find()
and print the
character at the returned index as follows, to verify that the method
works:
ii = place.find('B')
print(place[ii])
And this outputs B
, so that is fine. But what happens if we try
searching of a different letter, which isn't there? Well, the output
of the following:
ii = place.find('Z')
print(place[ii])
... is o
, so it appears that Python is wrong (!). However, if
we print the index, as well, we can see what is happening:
ii = place.find('Z')
print(ii)
print(place[ii])
... because now we recognize that the index is -1
, which the
docstring tells us to interpret as "not found", and we see that the
o
appears because place[-1]
picks out the last character.
split
| split(self, /, sep=None, maxsplit=-1)
| Return a list of the words in the string, using sep as the delimiter string.
|
| sep
| The delimiter according which to split the string.
| None (the default value) means split according to any whitespace,
| and discard empty strings from the result.
| maxsplit
| Maximum number of splits to do.
| -1 (the default value) means no limit.
|
This is a really useful method for helping us to identify words in a
string that contains a line of text. This method will break up a
string into a list (another ordered collection we will discuss
shortly) of strings, with the split happening at a
particular "separator" sep
; default sep(arator) is any whitespace.
So, consider the following:
sentence = '''Yamoussoukro:\t the political capital of Cote d'Ivoire.'''
print(sentence)
print(sentence.split())
... whose output is:
Yamoussoukro: the political capital of Cote d'Ivoire.
['Yamoussoukro:', 'the', 'political', 'capital', 'of', 'Cote', "d'Ivoire."]
Note that in each new string in the list, there is no whitespace. We
could choose to split the string at, say, only spaces or, say, at
'it'
:
print(sentence.split(' ')) # specify sep as arg by position
print(sentence.split(sep='it')) # specify sep as kwarg
... which outputs:
['Yamoussoukro:\t', 'the', 'political', 'capital', 'of', 'Cote', "d'Ivoire."]
['Yamoussoukro:\t the pol', 'ical cap', "al of Cote d'Ivoire."]
Note
There are many more methods for strings, and it is worth exploring them. The above gives a sampling, and some demonstrations of translating the docstring sections for each. The Practice Problems below include more, as well.
19.5. String formatting: Place data into strings¶
Often we want to display results of our work at several points in a program: in the middle of calculations to check the code's progress and/or to see that intermediate results are running; at the end to display the final results. Additionally, we might want to make labels for plotting, or save data to an output file for further use. To do this, we have to fully understand what we are outputting (the type, length/shape of object, etc.). Then decisions have to be made like how many decimal places to display, how to include text describing what each number is in many cases, how to make aligned columns of results, etc. This all comes under the category of string formatting: inserting the quantities of interest into strings and specifying what display properties it should have.
19.5.1. Basic format method¶
Up until this point, we have used print()
to display strings and
other types very simply (as described here), by separating each item with commas:
print("Finished.") print("x =", x) print("Avec =", A, "and Bvec =", B)
etc. We now look at more interesting ways to insert data into strings and format the results.
There are different methods and styles for performing this kind of
operation in Python, but we will primarily use the modern, built-in
string format()
method". From scrolling down/searching the
help(str)
docstring, we see that this method takes both positional
and keyword arguments:
...
format(...)
S.format(*args, **kwargs) -> str
Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
...
The basic approach for this is to write a string with {}
positioned as a placeholder each time we will want to insert a value
somewhere, and then we provide the values themselves as arguments.
The values can be either variables or expressions to be evaluated.
For example:
1x = 5
2print("x = {}".format(x))
3
4y = -15.5
5print("If y = {}, then five minus it = {}".format(y, 5-y))
6
7z = "Xhosa"
8print("The first two letters of '{}' are {}.".format(z, z[:2]))
9
10Avec = np.arange(-2, 2)
11Bvec = np.ones(2, dtype=bool)
12print("\nAvec = {} and Bvec = {}".format(Avec, Bvec))
produces:
x = 5
If y = -15.5, then five minus it = 20.5
The first two letters of 'Xhosa' are Xh.
Avec = [-2 -1 0 1] and Bvec = [ True True]
In this usage, if we have N values to insert, we will reserve N
spaces in string with curly brackets {}
and also have N,
comma-separated values within the format(...)
method. And note
from the example of printing arrays, the above works well for short
arrays, but for longer ones, we might want to make a loop display
indices:
1for idx in range( len(Avec) ):
2 print("[{}]th value is : {}".format(idx, Avec[idx]))
... produces:
[0]th value is : -2
[1]th value is : -1
[2]th value is : 0
[3]th value is : 1
This provides a nicer way to view large arrays, lists or other ordered collections. In the next section we will see more options to format this even more precisely, such as controlling vertical alignment, spacing and more.
Note
Here and in discussion below, we generally print the strings that are being formatted. This is just because we want to quickly display the results. However, in each case, we could alternatively save the results to a variable with assignment, such as:
z = "Xhosa"
var = "The first two letters of '{}' are {}.".format(z, z[:2])
... and use it further in some way. We mention this so
there is no misapprehension that string formatting only
occurs inside print()
-- instead, it is quite general and
useful.
19.5.2. Format specifiers: string type¶
Beyond just displaying a variable, we can control several aspects of
spacing, alignment and decimal output (where appropriate) using
format specifiers. These are placed within each pair of curly
brackets whose contents you want to format. The specifier starts with
a colon :
, can then contain some options, and typically ends with
the type. We first look at options for the string type, which is
denoted by ending the format specifier with an s
.
One of the most common formatting options for strings is to create a
window of a certain width (in terms of number of characters), into
which the string value is inserted, and the next characters appear
after that window. If the string length is greater than the window's
width, then it is effectively as if the window weren't specified
Additionally, one can specify whether to use left (<
), right
(>
) or center (^
) justification for the string within that
window. Consider the following:
ss = 'abc'
print("0123456789|") # just making a reference "count" of spaces
print("{:10s}|".format(ss)) # window = 10 spaces, str = def loc (left)
print("{:<10s}|".format(ss)) # window = 10 spaces, str = left loc
print("{:^10s}|".format(ss)) # window = 10 spaces, str = center loc
print("{:>10s}|".format(ss)) # window = 10 spaces, str = right loc
... which produces:
0123456789|
abc |
abc |
abc |
abc|
The vertical |
shows where the edge of the specified 10 char
window is in each case. Note how the center justification is "as
central as possible", depending on the relative even/oddness of the
window and inserted string.
When specifying justification, you can also choose to fill the spaces in the window with a particular character, such as in the following:
print("{:.<10s}|".format(ss))
print("{:-^10s}|".format(ss))
print("{:z>10s}|".format(ss))
... which produces:
abc.......|
---abc----|
zzzzzzzabc|
19.5.3. Format specifiers: numerical type¶
Python has a lot of numerical types, so perhaps we shouldn't be surprised that there are also several numerical specifier types (and here we only consider scalar/non-collection ones). Some of the more common ones include:
Type
Description
b
binary
d
integer
e
,E
scientific notation float, with "e" or "E", respectively
f
"fixed point" float (i.e., standard decimal notation)
g
,G
"general" format: if the value is within a couple magnitudes of 0, display as "fixed point"; else, using scientific notation
%
convert to percent (multiplies by 100) and display with
%
at end
Note
Mixing specifier and variable types between string and
numerical types (e.g., trying to put an int into a :s
specifier, or a str into a :f
specifier) produces an
error.
Within numerical types, trying to put an float into :d
fails, but putting an int into :f
is allowed.
So:
dd = 45
print("{:d}".format(dd))
print("{:e}".format(dd))
print("{:E}".format(dd))
print("{:f}".format(dd))
print("{:%}".format(dd))
... produces:
45
4.500000e+01
4.500000E+01
45.000000
4500.000000%
Python appears to default to a precision of 6 decimal places across
all relevant specifier types. This can be controlled by using
.NUMBER
in the specifier as follows:
print("PI is approx: {}".format(np.pi)) # def with no specifier
print("PI is approx: {:f}".format(np.pi)) # def for fixed point spec
print("PI is approx: {:.2f}".format(np.pi)) # 2 decimals
print("PI is approx: {:.25f}".format(np.pi)) # 25 decimals
Note that the displayed output is rounded, not just truncated:
PI is approx: 3.141592653589793
PI is approx: 3.141593
PI is approx: 3.14
PI is approx: 3.1415926535897931159979635
The window and justification syntax from the above string formatting applies equivalently to numerical types, and it can be combined with the precision specification, too:
ff = 0.123
print("{:10f}|".format(ff))
print("{:<10f}|".format(ff))
print("{:^10.3f}|".format(ff))
print("{:>10.0f}|".format(ff))
... produces:
0.123000|
0.123000 |
0.123 |
0|
Though we see that for numerical types, values are displayed with right justification by default (instead of left, for strings).
19.5.4. Format specifiers: practical example¶
The above is fun, sure, by why would we need to spend all this time thinking about windows for variables and numbers of decimal places?
Well, when we want to display data, vertical alignment and horizontal displacement can make a bit difference in being able to quickly (and accurately) assess output. Consider the following example, where we display an array:
C = np.array([-18.5, 300.1234, 0.1, 99.9999999, 25, 16, 91, 0.032, -14.4, 0, 1, 56])
N = len(C)
for i in range(N):
print("val [{}] --> {}".format(i, C[i]))
This produces:
val [0] --> -18.5
val [1] --> 300.1234
val [2] --> 0.1
val [3] --> 99.9999999
val [4] --> 25.0
val [5] --> 16.0
val [6] --> 91.0
val [7] --> 0.032
val [8] --> -14.4
val [9] --> 0.0
val [10] --> 1.0
val [11] --> 56.0
Because the array elements have no vertical alignment, it's hard to tell where the largest or smallest value might be: just because a number has a lot of decimal places doesn't mean it's large. Additionally, since there are more than 10 elements, the indices change from being single to double digit, which also makes it difficult to check the numbers. So, let's try to fix this with format specifiers. There is a lot of freedom here, but consider the following:
for i in range(N):
print("val [{:2}] --> {:15.8f}".format(i, C[i]))
... which produces:
val [ 0] --> -18.50000000
val [ 1] --> 300.12340000
val [ 2] --> 0.10000000
val [ 3] --> 99.99999990
val [ 4] --> 25.00000000
val [ 5] --> 16.00000000
val [ 6] --> 91.00000000
val [ 7] --> 0.03200000
val [ 8] --> -14.40000000
val [ 9] --> 0.00000000
val [10] --> 1.00000000
val [11] --> 56.00000000
This looks much easier to assess! Again, details will matter a lot: we kind of have to know the largest index value to know how many spaces it might take up; we have to know the type and relevant number of decimal places in the array; etc. However, once we know those things, we can really make nice visual displays with just a couple parameters.
19.5.5. Format specifiers: nesting (bonus)¶
As a small, extra (but fun!) point about format specifiers: the specifier options themselves can be inserted, via nested specifiers. That is, consider the following:
print("The values are: |{:{}.{}f}| |{:{}.{}f}|".format(1.234567, 15, 9, 9.87, 10, 5))
... which produces:
The values are: | 1.234567000| | 9.87000|
What is happening here? Well, by counting the curly braces, there are 6 values to insert. How do we know the order in which values will be placed? Well, start from left to right; when we get a curly bracket, that will get the first value. We then check for any inner curly brackets, and if there are some, go left-right filling in with the next values. When done, continue through the string for any more "outer" curly brackets, and repeat. Thus, the above is equivalent to:
print("The values are: |{:15.9f}| |{:10.5f}|".format(1.234567, 9.87))
(You can verify this more precisely by setting the string currently in
each print()
to a variable, and use ==
to check strict equality.)
19.6. Practice¶
- Print the following string:
I don't really care for ", ', ''' or """. But \ is OK.
Use the str method replace to:
put your own name in the place of
who?
in this string:My name is who?
.swap
old
andnew
in the following string:What's old is now new!
.
Use the str method strip to:
trim whitespace from around (but not within) the sentence in this str:
\tHere is the sentence, by itself. \n
remove the dashes from
------this string---------
leave only the word
center
in:nnnnncenternnnnn
What is the name of the related method to strip whitespace (or other characters) from only the right side of the string?
Write a for-loop to see how many orders of magnitude from zero a number has to be before the
g
format specifier changes from displaying a fixed point float to using scientific notation. Investigate using powers of 10, with both negative and positive exponentials.Can you draw the following picture using only one print statement?