XLSX are Just Text Files in a Trench Coat


It’s XML All the Way Down

A few years ago I was tasked to use an API with poor documentation that was only available in SOAP. Fortunately, there is a Python library called Zeep (docs.python-zeep.org/) – without it, I would have been completely lost. I grumbled through this task, feeling like I was working on something that if not already obsolete was well on its way.

Only recently, I learned that XLSX files are actually zipped XML files. So let’s take one apart:

1. Initial CSV File

Starting with a CSV file has about 1200 rows and 17 columns comes to 210 KB.

user/downloads/ └── data.csv (210 KB)

2. CSV Saved as .xlsx

Since the file is zipped, saving it as a .xlsx file reduces the file size to 141 KB.

user/downloads/ ├── data.csv (210 KB) └── data.xlsx (141 KB)

3. Formatted .xlsx File

Creating a formatted table and changing the fonts increases the size of the file to 768 KB.

user/downloads/ ├── data.csv (210 KB) └── data.xlsx (768 KB)

4. Now you can change the file extension to .zip and unzip the file to see the contents

Top of the xml folder

What we thought was a single file is actually a nested folder structure.

├── xl/ │ ├── _rels/ │ │ └── workbook.xml.rels │ ├── theme/ │ │ └── theme1.xml │ ├── worksheets/ │ │ └── sheet1.xml │ ├── styles.xml │ ├── workbook.xml │ └── sharedStrings.xml

nested xml

Now the weird changes in file size and why some (relatively) small XLSX files take so long to open almost makes sense. When I heard it was the Microsoft standard since 2007, I instantly assumed it was some nefarious plot to consolidate their market share. But actually, it opened up Excel from a proprietary to an open format called Office Open XML (OOXML). My bad Microsoft. I guess I should have read the ISO/IEC 29500:2008 standard before jumping to conclusions.

XML Forever?

XML (eXtensible Markup Language) was created in the late 1990s to replace SGML which I am guessing was a replacement for something else. JSON (JavaScript Object Notation) is a lighter weight alternative but obviously ther are still many applications where XML is the better choice. Or in the case of XLSX, the only choice.