PyDays 2017

PyDays 2017 was Austria's first conference dedicated to the Python programming language. It took place on May 5 and May 6 and was graciously hosted by the Linuxwochen Wien at FH Technikum in Vienna. It was great on many levels: meeting new people excited about Python and where it's headed, over 20 talks, interesting hallway conversations, etc.

I helped out with the organization of the conference, mostly taking care of catering. We (the organization team) are very happy with how everything went. The atmosphere was welcoming and open, the quality of the talks and workshops was very good, attendance was great, and people seemed to have a really good time. From an organizational perspective, the talks mostly stayed on-time, the audio/video hardware worked fine, and the PyDays booth was well-received. Giving out snacks, coffee, soda, cake during breaks worked out nicely and was received very well. The feedback, both at and after the conference, was very positive.

In addition to the co-organization of the Conference, I also held a talk about Dask, a Python library for parallel computing. It provides data structures similar to NumPy Arrays or Python Dataframes that process operations in parallel and scale to out-of-memory datasets. Dask is exciting since it provides the analytical power and familiar mental model of Dataframes, but handles hundreds of gigabytes of data and allows seamless scaling from a single laptop to a computing cluster. The talk was well-received, and there was a lot of interest at the end and in the hallway afterwards. The slides are online, if you're interested as well.

Finally, I'd like to thank my co-organizers, Claus, Kay, Helmut and Sebastian, as well as the people who helped out during the conference, for all their great work to make the PyDays a success. I'd also like to thank our sponsors: the Python Software Foundation, the Python Software Verband e.V., T-Mobile, UBIMET, and Jetbrains.

Overall it was great to see so much community interest in Python. If you're passionate about Python as well, join us on the Python Austria Slack and our meetups!


Working with angles (is surprisingly hard)

Due to the periodic nature of angles, especially the discontinuity at 2π / 360°, working with them presents some subtle issues. For example, the angles 5° and 355° are "close" to each other, but are numerically quite different. In a geographic (GIS) context, this often surfaces when working with geometries spanning Earth's date-line at -180°/+180° longitude, which often requires multiple code paths to obtain the desired result.

I've recently run into the problem of computing averages and differences of angles. The difference should increase linearly, it should be 0 if they are equal, negative if the second angle lies to one side of the first, positive if it's on the other side (whether that's left or right depends on whether the angles increase in the clockwise direction or in counter-clockwise direction). Getting that right and coming up with test cases to prove that it's correct was quite interesting. As Bishop notes in Pattern Recognition and Machine Learning, it's often simpler to perform operations on angles in a 2D (x, y) space and then back-transform to angles using the atan2() function. I've used that for the averaging function; the difference is calculated using modulo arithmetic.

Here's the Python version of the two functions:

def average_angles(angles):
    """Average (mean) of angles

    Return the average of an input sequence of angles. The result is between
    ``0`` and ``2 * math.pi``.
    If the average is not defined (e.g. ``average_angles([0, math.pi]))``,
    a ``ValueError`` is raised.
    """

    x = sum(math.cos(a) for a in angles)
    y = sum(math.sin(a) for a in angles)

    if x == 0 and y == 0:
        raise ValueError(
            "The angle average of the inputs is undefined: %r" % angles)

    # To get outputs from -pi to +pi, delete everything but math.atan2() here.
    return math.fmod(math.atan2(y, x) + 2 * math.pi, 2 * math.pi)


def subtract_angles(lhs, rhs):
    """Return the signed difference between angles lhs and rhs

    Return ``(lhs - rhs)``, the value will be within ``[-math.pi, math.pi)``.
    Both ``lhs`` and ``rhs`` may either be zero-based (within
    ``[0, 2*math.pi]``), or ``-pi``-based (within ``[-math.pi, math.pi]``).
    """

    return math.fmod((lhs - rhs) + math.pi * 3, 2 * math.pi) - math.pi

The code, along with test cases can also be found in this GitHub Gist. Translation of these functions to other languages should be straight-forward, sin()/cos()/fmod()/atan2() are pretty ubiquitous.


Resource management with Python

There should be one – and preferably only one – obvious way to do it.

— Tim Peters in The Zen of Python

There are multiple ways to manage resources with Python, but only one of them is save, reliable and Pythonic.

Before we dive in, let's examine what resources can mean in this context. The most obvious examples are open files, but the concept is broader: it includes locked mutexes, started client processes, or a temporary directory change using os.chdir(). The common theme is that all of these require some sort of cleanup that must reliably be executed in the future. The file must be closed, the mutex unlocked, the process terminated, and the current directory must be changed back.

So the core question is: how to ensure that this cleanup really happens?

Failed solutions

Manually calling the cleanup function at the end of a code block is the most obvious solution:

f = open('file.txt', 'w')
do_something(f)
f.close()

The problem with this is that f.close() will never be executed if do_something(f) throws an exception. So we'll need a better solution.

C++ programmers see this and try to apply the C++ solution: RAII, where resources are acquired in an object's constructor and released in the destructor:

class MyFile(object):
    def __init__(self, fname):
        self.f = open(fname, 'w')

    def __del__(self):
        self.f.close()

my_f = MyFile('file.txt')
do_something(my_f.f)
# my_f.__del__() automatically called once my_f goes out of scope

Apart from being verbose and a bit un-Pythonic, it's also not necessarily correct. __del__() is only called once the object's refcount reaches zero, which can be prevented by reference cycles or leaked references. Additionally, until Python 3.4 some __del__() methods were not called during interpreter shutdown.

A workable solution

The way to ensure that cleanup code is called in the face of exceptions is the try ... finally construct:

f = open('file.txt', 'w')
try:
    do_something(f)
finally:
    f.close()

In contrast to the previous two solutions, this ensures that the file is closed no matter what (short of an interpreter crash). It's a bit unwieldy, especially when you think about try ... finally statements sprinkled all over a large code base. Fortunately, Python provides a better way.

The correct solution™

The Pythonic solution is to use the with statement:

with open('file.txt', 'w') as f:
    do_something(f)

It is concise and correct even if do_something(f) raises an exception. Nearly all built-in classes that manage resources can be used in this way.

Under the covers, this functionality is implemented using objects known as context managers, which provide __enter__() and __exit__() methods that are called at the beginning and end of the with block. While it's possible to write such classes manually, an easier way is to use the contextlib.contextmanager decorator.

from contextlib import contextmanager

@contextmanager
def managed_resource(name):
    r = acquire_resource(name)
    try:
        yield r
    finally:
        release_resource(r)

with managed_resource('file.txt') as r:
    do_something(r)

The contextmanager decorator turns a generator function (a function with a yield statement) into a context manager. This way it is possible to make arbitrary code compatible with the with statement in just a few lines of Python.

Note that try ... finally is used as a building block here. In contrast to the previous solution, it is hidden away in a utility resource manager function, and doesn't clutter the main program flow, which is nice.

If the client code doesn't need to obtain an explicit reference to the resource, things are even simpler:

@contextmanager
def managed_resource(name):
    r = acquire_resource(name)
    try:
        yield
    finally:
        release_resource(r)

with managed_resource('file.txt'):
    do_something()

Sometimes the argument comes up that this makes it harder to use those resources in interactive Python sessions – you can't wrap your whole session in a gigantic with block, after all. The solution is simple: just call __enter__() on the context manager manually to obtain the resource:

cm_r = managed_resource('file.txt')
r = cm_r.__enter__()
# Work with r...
cm_r.__exit__(None, None, None)

The __exit__() method takes three arguments, passing None here is fine (these are used to pass exception information, where applicable). Another option in interactive sessions is to not call __exit__() at all, if you can live with the consequences.

Wrap Up

Concise, correct, Pythonic. There is no reason to ever manage resources in any other way in Python. If you aren't using it yet - start now!


libconf - a Python reader for libconfig files

This weekend, I uploaded my first package to PyPI: libconf, a pure-Python reader for files in libconfig format. This configuration file format is reminiscent of JSON and is mostly used in C/C++ projects through the libconfig library. It looks like this:

version = 7;
window: {
   title: "libconfig example"
   position: { x: 375; y: 210; w: 800; h: 600; }
};
capabilities: {
   can-do-lists: (true, 0x3A20, ("sublist"), {subgroup: "ok"})
   can-do-arrays: [3, "yes", True]
};

There are already two Python implementations: pylibconfig2 is a pure-Python reader licensed under GPLv3 and python-libconfig provides bindings for the libconfig C++ library. The first one I didn't like because of it's licensing, the second one I didn't like because of the more involved installation procedure. Also, I kind of enjoy writing parsers.

So, I set down during the easter weekend and wrote libconf. It's a pure-Python reader for libconfig files with an interface similar to the Python json module. There are two main methods: load(f) and loads(string). Both return a dict-like data-structure that can be indexed (config['version']), but supports attribute access as well (config.version):

import libconf
>>> with open('example.cfg') as f:
...     config = libconf.load(f)
>>> config['window']['title']
'libconfig example'
>>> config.window.title
'libconfig example'

It was a fun little project. Creating a recursive descent parser is pretty straightforward, especially for such a simple file format. Writing documentation, packaging and uploading to GitHub and PyPI took longer than coding up the implementation itself.


dh_virtualenv and long package names (FileNotFound error)

Interesting tidbit about Linux:

A maximum line length of 127 characters is allowed for the first line in a #! executable shell script.

from man execve

Should be enough, right?

Wrong.

Well, not if you are using dh_virtualenv with long package names, anyway:

Installing pip...
  Error [Errno 2] No such file or directory while executing command /tmp/lo...me/bin/easy_install /usr/share/python-virtualenv/pip-1.1.tar.gz
...Installing pip...done.
Traceback (most recent call last):
  File "/usr/bin/virtualenv", line 3, in <module>
virtualenv.main()
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 938, in main
never_download=options.never_download)
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 1054, in create_environment
install_pip(py_executable, search_dirs=search_dirs, never_download=never_download)
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 643, in install_pip
filter_stdout=_filter_setup)
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 976, in call_subprocess
cwd=cwd, env=env)
  File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/bin/dh_virtualenv", line 106, in <module>
sys.exit(main() or 0)
  File "/usr/bin/dh_virtualenv", line 83, in main
deploy.create_virtualenv()
  File "/usr/lib/python2.7/dist-packages/dh_virtualenv/deployment.py", line 112, in create_virtualenv
subprocess.check_call(virtualenv)
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['virtualenv', '--system-site-packages', '--setuptools', 'debian/long-package-name/usr/share/python/long-package-name']' returned non-zero exit status 1
make: *** [binary-arch] Error 1
dpkg-buildpackage: error: fakeroot debian/rules binary gave error exit status 2

dh_virtualenv is used to create Debian packages that include Python virtualenvs. It is one of the better ways of packaging Python software, especially if there are Python dependencies that are not available in Debian or Ubuntu. When building a .deb package, it creates a virtualenv in a location such as:

/<build-directory>/debian/<packagename>/usr/share/python/<packagename>

This virtualenv has several tools under its bin/ directory, and they all have the absolute path of the virtualenv's Python interpreter hard-coded in their #! shebang line:

#!/<build-directory>/debian/<packagename>/usr/share/python/<packagename>/bin/python

Given that <build-directory> often contains the package name as well, it's easy to overflow the 128 byte limit of the #! shebang line. In my case, with a ~30 character package name, the path length grew to 160 characters!

Consequently, the kernel couldn't find the Python executable anymore, and running any of the tools from the bin/ directory gave an ENOENT (file not found) error. This is what happened when virtualenv tried to install pip during the initial setup. The root cause of this error is not immediately obvious, to say the least.

To check whether this affects you, check the line length of any script with wc:

head -n 1 /path/to/virtualenv/bin/easy_install | wc -c

If that's larger than 128, it's probably the cause of the problem.

The fix is to change the package name and/or the build location to something shorter. The alternative would be to patch the Linux kernel, which – depending on your preferences – sounds either fun or really unpleasant. Suit yourself!


Plotting maps with Folium

Data visualization in Python is a well solved problem by now. Matplotlib and it's prettier cousin Seaborn are widely used to generate static graphs. Bokeh generates HTML files with interactive, JavaScript-based graphs. It's a great way of sharing data with other people who don't have a Python development environment ready. Several other libraries exist for more specialized purposes.

What has been missing for a long time was good map libraries. Plotting capabilities were fine, but basemap support of the existing libraries was very limited. For example, the popular Matplotlib-basemap has great plot types (contour maps, heatmaps, ...) but can't show any high-resolution maps: it only has country/state shapes or whole-world images. Consequently, it's useless for drawing city or street level maps, unless you want to set up your own tile server (you don't).

Along comes Folium, a library that generates interactive maps in HTML format based on Leaflet.js. It supports, among others, OpenStreetMap and MapBox base layers which look great and provide enough details for large-scale maps.

Here is an example that shows some GPS data I cleaned up with a Kalman filter:

def plot(points, center):
    map_osm = folium.Map(location=center, zoom_start=16, max_zoom=23)
    map_osm.line(locations=points)
    map_osm.create_map(path='folium-example.html')

Here's what it looks like. I find it pretty neat, especially given that it took only 3 lines of code to create:

 


Returning generators from with statements

Recently, an interesting issue came up at work that involved a subtle interaction between context managers and generator functions. Here is some example code demonstrating the problem:

@contextlib.contextmanager
def resource():
    """Context manager for some resource"""

    print("Resource setup")
    yield
    print("Resource teardown")


def _load_values():
    """Load a list of values (requires resource to be held)"""

    for i in range(3):
        print("Generating value %d" % i)
        yield i


def load_values():
    """Load values while holding the required resource"""

    with resource():
        return _load_values()

This is the output when run:

>>> for val in load_values(): pass
Resource setup
Resource teardown
Generating value 0
Generating value 1
Generating value 2

Whoops. The resource is destroyed before the values are actually generated. This is obviously a problem if the generator depends on the existence of the resource.

When you think about it, it's pretty clear what's going on. Calling _load_values() produces a generator object, whose code is only executed when values are requested. load_values() returns that generator, exiting the with statement and leading to the destruction of the resource. When the outer for loop (for val) comes around to iterating over the generator, the resource is long gone.

How do you solve this problem? In Python 3.3 and newer, you can use the yield from syntax to turn load_values() into a generator as well. The execution of load_values() is halted at the yield from point until the child generator is exhausted, at which point it is safe to dispose of the resource:

def load_values():
    """Load values while holding the required resource"""

    with resource():
        yield from _load_values()

In older Python versions, an explicit for loop over the child generator is required:

def load_values():
    """Load values while holding the required resource"""

    with resource():
        for val in _load_values():
            yield val

Still another method would be to turn the result of _load_values() into a list and returning that instead. This incurs higher memory overhead since all values have to be held in memory at the same time, so it's only appropriate for relatively short lists.

To sum up, it's a bad idea to return generators from under with statements. While it's not terribly confusing what's going on, it's a whee bit subtle and not many people think about this until they ran into the issue. Hope this heads-up helps.


A better way for deleting Docker images and containers

In one of my last posts, I described the current (sad) state of managing Docker container and image expiration. Briefly, Docker creates new containers and images for many tasks, but there is no good way to automatically remove them. The best practice seems to be a rather hack-ish bash one-liner.

Since this wasn't particularly satisfying, I decided to do something about it. Here, I present docker-cleanup, a Python application for removing containers and images based on a configurable set of rules.

This is a rules file example:

# Keep currently running containers, delete others if they last finished
# more than a week ago.
KEEP CONTAINER IF Container.State.Running;
DELETE CONTAINER IF Container.State.FinishedAt.before('1 week ago');

# Delete dangling (unnamed and not used by containers) images.
DELETE IMAGE IF Image.Dangling;

Clear, expressive, straight-forward. The rule language can do a whole lot more and provides a readable and intuitive way to define removal policies for images and containers.

Head over to GitHub, give it a try, and let me know what you think!


Using Python slice objects for fun and profit

Just a quick tip about the hardly known slice objects in Python. They are used to implement the slicing syntax for sequence types (lists, strings):

s = "The quick brown fox jumps over the lazy dog"

# s[4:9] is internally converted (and equivalent) to s[slice(4, 9)].
assert s[4:9] == s[slice(4, 9)]

# 'Not present' is encoded as 'None'
assert s[20:] == s[slice(20, None)]

slice object can be used in normal code too, for example for tracking regions in strings: instead of having separate start_idx and end_idx variables (or writing a custom class/namedtuple) simply roll the indices into a slice.

# A column-aligned table:
table = ('REPOSITORY   TAG      IMAGE ID       CREATED       VIRTUAL SIZE',
         '<none>       <none>   0987654321AB   2 hours ago   385.8 MB',
         'chris/web    latest   0123456789AB   2 hours ago   385.8 MB',
        )
header, *entries = table

# Compute the column slices by parsing the header. Gives a list of slices.
slices = find_column_slices(header)

for entry in entries:
    repo, tag, id, created, size = [entry[sl].strip() for sl in slices]
    ...

This is mostly useful when the indices are computed at runtime and applied to more than one string.

More generally, slice objects encapsulate regions of strings/lists/tuples, and are an appropriate tool for simplifying code that operates on start/end indices. They provide a clean abstraction, make the code more straight-forward and save a bit of typing.


A neat Python debugger command

pdb is a console-mode debugger built into Python. Out of the box, it has basic features like variable inspection, breakpoints, and stack frame walking, but it lacks more advanced capabilities.

Fortunately, it can be customized with a .pdbrc file in the user's home directory. Ned Batchelder has several helpful commands in his .pdbrc file:

  • pl: print local variables
  • pi obj: print the instance variables of obj
  • ps: print the instance variables of self

Printing instance variables is great for quickly inspecting objects, but it shows only one half of the picture. What about the class-side of objects? Properties and methods are crucial for understanding what can actually be done with an object, in contrast to what data it encapsulates.

Since I couldn't find a readily available pdb command for listing class contents, I wrote my own:

# Print contents of an object's class (including bases).
alias pc for k,v in sorted({k:v for cls in reversed(%1.__class__.__mro__) for k,v in cls.__dict__.items() if cls is not object}.items()): print("%s%-20s= %-80.80s" % ("%1.",k,repr(v)))

pc lists the contents of an object's class and its base classes. Typically, these are the properties and methods supported by the object. It is used like this:

# 'proc' is a multiprocessing.Process() instance.
(Pdb) pc proc
...
proc.daemon              = <property object at 0x036B9A20>
proc.exitcode            = <property object at 0x036B99C0>
proc.ident               = <property object at 0x036B9A50>
proc.is_alive            = <function BaseProcess.is_alive at 0x033E4618>
proc.join                = <function BaseProcess.join at 0x033E45D0>
proc.name                = <property object at 0x036B99F0>
proc.pid                 = <property object at 0x036B9A50>
proc.run                 = <function BaseProcess.run at 0x033E4A98>
proc.start               = <function BaseProcess.start at 0x033E4DB0>
proc.terminate           = <function BaseProcess.terminate at 0x033E4DF8>

Note the difference to pi, which lists the contents of the proc instance:

(Pdb) pi proc       # In contrast, here is the image dictionary.
proc._args          = ()
proc._config        = {'authkey': b'\xd0\xc8\xbd\xd6\xcf\x7fo\xab\x19_A6\xf8M\xd4\xef\x88\xa9;\x99c\x9
proc._identity      = (2,)
proc._kwargs        = {}
proc._name          = 'Process-2'
proc._parent_pid    = 1308
proc._popen         = None
proc._target        = None

In general, pc focuses on the interface while pi examines the state of the object. The two complement each other nicely. Especially when working with an unfamiliar codebase, pc is helpful for quickly figuring out how to use a specific class.

pc works with both Python 2 and Python 3 (on Python 2 it only shows new-style classes). Add it to your .pdbrc and give it a try. Let me know what you think!