21. Be careful: copying arrays, lists (and more)

This is an important note for working with arrays, lists and some other collection types in Python. It is so important, we have given it its own section so it won't get missed (glad you are here now!). The basic message is: be careful when copying objects in Python, be sure you are actually making a copy.

We demonstrate using arrays, so let's import numpy:

import numpy as np

21.1. The problem

Let's say you made the following array:

A = np.array([3, 4, 5, 6, 7])

... and then you wanted to copy it and change one of the elements in the new copy. So you do the following:

B    = A                            # copy A to B
B[2] = -100                         # change an element of B
print(B)

... and this outputs what we would expect and want for B:

[   3    4 -100    6    7]

Fine, we just made B from A, and then changed the value at the location we wanted. Everything seems great.

... until we check back on the initial array A. You might ask, why would we do this, since we weren't changing that array, right? Well, let's just print(A) and see what happens. And we get:

[   3    4 -100    6    7]

Huh!?!! That is not a copy+paste error. A itself has somehow changed, even though we only re-assigned an element of B!

And what happens if we just want to copy part of the array? Let's find out:

A    = np.array([3, 4, 5, 6, 7])    # initialize array
B    = A[1:]                        # "copy"/assign a SUBSET of values to B
B[1] = -200                         # change one element of B...

print(A)
print(B)

... also shows the same issue:

[   3    4 -200    6    7]
[   4 -200    6    7]

Let's explore why this occurs, and how we can properly copy.

21.2. Why is this happening?

We were aiming to copy array A, in order to make a whole new array B that had the same values at each element. We tried to do this with the assignment operator: B = A. The problem actually originates with how we are thinking of A compared to what the Python interpreter thinks of A, and what assignment of an array actually means. Let's investigate.

We created an array of 5 values, and then assigned that object to a variable called A.

  • Our (possible) perception: First we create an array A and assign it values; so somewhere in memory are those 5 values, and that vector-like thing of them is A. If we type A, Python gets all the array values into memory and prepares to work with them. And B = A means: get the full set of values in A, put them somewhere currently unused in memory, and store that new array under the name B. After this, A and B are separate arrays, so we have the same values in two separate locations---copy complete!

  • Python's perception: First we create an array A and assign it values; so somewhere in memory are those 5 values, and the location of their starting spot has some "address". Assign the label A to that address, so that when the human refers to it, the interpreter goes to that address where those values are stored (it doesn't just pick up all the values of A). And B = A means: get the address of A, and assign that address to a new label, called B, but the original label to the same address still exists. After this, we still have only one array of values, and both A and B each refer to it. Thus, the same array of values can now be selected or changed by referring to either name---a situation ripe for confusion!

In the battle of perceptions here, Python's wins. We might think of A as the whole list of values, but to Python A just points to the start of the values in memory (to what is called the object's reference). What we thought was copying the array was instead assigning a new label to the same address.

Note that this is a purely computational situation. How Python handles the array object internally is what is at issue---well, that, along with our understanding of it.

21.3. So, how do we copy an array?

There are a couple of ways to copy an array. For example, we could initialize a new array, then loop over the old one, and provide it new values element by element. It works, but that seems like a lot of code just to copy an array.

Probably the safest and most efficient way to do this is to use Python's "copy" module. One of the functions there is called deepcopy(), and this will be our function of choice for copying objects that have this issue. Let's try using this to redo our example from above (with some newly-named arrays!):

import copy

C    = np.array([3, 4, 5, 6, 7])
D    = copy.deepcopy(C)
D[2] = -100

print("C:", C)
print("D:", D)

... produces the kind of copying behavior we had wanted all along:

C: [3 4 5 6 7]
D: [   3    4 -100    6    7]

21.4. What object types does this affect?

You might be saying to yourself: Hey, wait a minute! If we take a float and try using simple assignment to copy it:

x = 1.0
y = x
y = -5.0

print("x:", x)
print("y:", y)

... we actually do seem to copy it OK---the output is:

x: 1.0
y: -5.0

So, when exactly do we have to worry about copying and object and use copy.deepcopy?

That's an excellent question. And indeed, this behavior happens to a particular set of Python objects. Specifically, the mutable objects in Python: those that are changeable (= mutable) once they have been made. The mutable object umbrella includes:

  • arrays

  • lists

  • dictionaries (which we have not discussed yet)

  • sets (which we have not discussed yet)

Each of the above types is a collection whose elements we can change with re-assignment, like A[0] = -1.

In contrast, immutable objects are those that cannot be changed once created. These include:

  • strings

  • tuples (which we have not discussed yet)

  • numeric types: int, float, complex

  • bool

Recall how we could not reassign elements in a string---the following produces an error:

some_str      = 'But I am constant as the northern star.'
some_str[-14] = 'N'

Instead, we can use string methods (e.g., replace) to create a new string with altered values, and reassign that to the original string's variable name. But the reason we have to use such a roundabout method of replacing a character is because of the underlying string property of being immutable.

We might have thought that floats are mutable, because if we write:

var = 4.0
var = 5.0

... haven't we just shown it is mutable? Well, actually, no. Python is really creating a new value somewhere in memory and re-assigning the label, rather than keeping the same label and changing the value there. So, indeed, these rules are consistent, even if we wouldn't necessarily have thought of it that way.

Note

There is a built-in function in Python called id that returns the numerical "identity" of an object; something like a unique identifier when a new object is created. We can use this to verify that the above float really does get created anew, rather than staying the same and having its value changed:

var = 4.0
print("id =", id(var))
var = 5.0
print("id =", id(var))

When we ran this on our computer it produced:

id = 140283284338128
id = 140283275379280

... which indeed are not equal (and the exact numbers will likely differ on your computer, or even on ours if we re-ran it).

Q: Use the id function to show that an array is mutable. That is, initialize an array, then change an element's, and show that the identity of the overall array has not changed.

+ show/hide code

Q: Use the id function to show that the "naive" copying of an array really doesn't create a separate object. (Just in case you didn't believe us, up above.)

+ show/hide code

21.5. Comment: deep vs shallow copying

If the copying function of choice is called deepcopy, does that imply there is such a thing as a "shallow copy"? Actually, yes it does. What is the difference between a shallow copy and a deep copy?

We have already seen nested lists: having a list as an element with a list. That creates a hierarchical structure of collections. Why does this matter? Well, a "shallow copy" will only actually copy the values in the top level list (make a new id value for the outermost list), but the other lists that are "deeper" in the hierarchical structure are not guaranteed to be copied---those might still refer to the same locations for the inner layers of the original list.

So, a shallow copy might be safe for applying to a 1D object, but not to higher dimensional objects. A "deep copy" really copies everything throughout the object, at all layers.

Note

There are some additional subtleties to this. The behavior with shallow copying might be affected by the dtypes of the inner elements. Also, arrays and lists can behave a little differently (e.g., using index selection with lists seems to at least provide a shallow copy, which did not occur with arrays in the top section). However, copy.deepcopy should always be the most reliable option.

As noted above, arrays are similar to lists in some ways, and this is one of them. Briefly, a 2D array in programming is often best thought of as an "array of arrays", rather than just a matrix of numbers. The former description is a more accurate picture of how most computers deal with arrays (extending to higher dimensionality, too). We might picture a 2D array as having a flat structure of elements, just sitting there in a square or rectangle. But internally, it is really structured in a hierarchy: a top-level 1D array, each of whose elements is another 1D array. That is, a 2D (or higher dimensional) array is also a nested structure of collections. Therefore, the same issues of shallow vs deep copying apply, as for lists.

There is a copy function in numpy called np.copy, as well as a method copy. One would think that it would provide safe copying, but in fact it only outputs shallow copies. This function's own help recommends using copy.deepcopy:

 1Note that np.copy is a shallow copy and will not copy object
 2elements within arrays. This is mainly important for arrays
 3containing Python objects. The new array will contain the
 4same object which may lead to surprises if that object can
 5be modified (is mutable):
 6
 7>>> a = np.array([1, 'm', [2, 3, 4]], dtype=object)
 8>>> b = np.copy(a)
 9>>> b[2][0] = 10
10>>> a
11array([1, 'm', list([10, 3, 4])], dtype=object)
12
13To ensure all elements within an ``object`` array are copied,
14use `copy.deepcopy`:
15
16>>> import copy
17>>> a = np.array([1, 'm', [2, 3, 4]], dtype=object)
18>>> c = copy.deepcopy(a)
19>>> c[2][0] = 10
20>>> c
21array([1, 'm', list([10, 3, 4])], dtype=object)
22>>> a
23array([1, 'm', list([2, 3, 4])], dtype=object)

21.6. In summary ...

Be careful when you are copying an array, list or any other mutable type object. We cannot do the following to copy, in the sense of having two distinct objects to manipulate separately:

A = <some array, list or other mutable object>
B = A

In particular, using the copy.deepcopy method is generally safe for these kinds of objects:

import copy

A = <some array, list or other mutable object>
B = copy.deepcopy( A )

In our own thought-processing and algorithm generation, we might picture the variable A as representing the whole array/list/etc.---that is fine. But when it comes to implementing the program, we need to be aware that it is something different to Python (it is just the starting point reference of the object), so that we can deal with it appropriately. We need to be aware of this computational aspect to avoid buggy or incorrect behavior.

If you are unsure, verify what you are doing with an example or a print function.