Skip links

Alert: 15-year-old Python tarfile flaw lurks in ‘over 350,000’ code projects

At least 350,000 open source projects are believed to be potentially vulnerable to exploitation via a Python module flaw that has remained unfixed for 15 years.

On Tuesday, security firm Trellix said its threat researchers had encountered a vulnerability in Python’s tarfile module, which provides a way to read and write compressed bundles of files known as tar archives. Initially, the bug hunters thought they’d chanced upon a zero-day.

It turned out to be about a 5,500-day issue: the bug has been living its best life for the past decade-and-a-half while awaiting extinction.

Identified as CVE-2007-4559, the vulnerability surfaced on August 24, 2007, in a Python mailing list post from Jan Matejek, who was at the time the Python package maintainer for SUSE. It can be exploited to potentially overwrite and hijack files on a victim’s machine, when a vulnerable application opens a malicious tar archive via tarfile.

“The vulnerability goes basically like this: If you tar a file named "../../../../../etc/passwd" and then make the admin untar it, /etc/passwd gets overwritten,” explained Matejek at the time.

The tarfile directory traversal flaw was reported on August 29, 2007 by Tomas Hoger, a software engineer at Red Hat.

But it had already been addressed, sort of. One day earlier, Lars Gustäbel, maintainer of the tarfile module, committed a code change that adds a default true check_paths parameter and a helper function to the TarFile.extractall() method that throws an error if a tar archive file path is insecure.

But the fix did not address the TarFile.extract() method – which Gustäbel said “should not be used at all” – and left open the possibility that extracting data from untrusted archives might cause problems.

In a comment thread, Gustäbel explained he no longer considers this a security issue. “ does nothing wrong, its behavior conforms to the pax definition and pathname resolution guidelines in POSIX,” he wrote.

“There is no known or possible practical exploit. I [updated] the documentation with a warning that it might be dangerous to extract archives from untrusted sources. That is the only thing to be done IMO.”

Indeed, the documentation describes this footgun:

Warning: Never extract archives from untrusted sources without prior inspection. It is possible that files are created outside of path, e.g. members that have absolute filenames starting with "/" or filenames with two dots "..".

And yet here we are, with both the extract() and extractall() still posing the threat of arbitrary path traversal.

“The vulnerability is a path traversal attack in the extract and extractall functions in the tarfile module that allow an attacker to overwrite arbitrary files by adding the ‘..’ sequence to filenames in a tar archive,” explained Kasimir Schulz, a vulnerability researcher for Trellix, in a blog post.

The “..” sequence changes the current working path to the parent directory. So using code like the six-line snippet below, Schulz says, the tarfile module can be told to read and modify the file’s metadata before it’s added to the tar archive. And the result is an exploit.

import tarfile def change_name(tarinfo): = "../" + return tarinfo with"exploit.tar", "w:xz") as tar: tar.add("malicious_file", filter=change_name)

According to Schulz, Trellix built a free tool called Creosote to scan for CVE-2007-4559. The software has already found the bug lurking in applications like Spyder IDE, an open-source scientific environment written for Python, and Polemarch, an IT infrastructure management service for Linux and Docker.

The company estimates the tarfile flaw can be found “in over 350,000 open-source projects and prevalent in closed-source projects.” It also points out that tarfile is a default module in any Python project and is present in frameworks created by AWS, Facebook, Google, and Intel, and in applications for machine learning, automation, and Docker containers.

Trellix says it’s working to make repaired code available to affected projects.

“Using our tools, we currently have patches for 11,005 repositories, ready for pull requests,” explained Charles McFarland, a vulnerability researcher for Trellix, in a blog post. “Each patch will be added to a forked repository and a pull request made over time. This will help individuals and organizations alike become aware of the problem and give them a one click fix.

“Due to the size of vulnerable projects we expect to continue this process over the next few weeks. This is expected to hit 12.06 percent of all vulnerable projects, a little over 70K projects by the time of completion.”

The remaining 87.94 percent of affected projects may wish to consider other possible options. ®