I first started using Excel in 1997 or 1998. I was in college majoring in accounting, so it was inevitable that I would need to learn the program. But I fell in love with Excel during my first internship. I don’t remember much about the job, but I do remember that it involved a lot of Excel (and the “Geac” accounting system if I remember correctly). And there was one part of the job that involved opening a particular Excel spreadsheet, updating a few cells, and pushing a button (on the spreadsheet) to create the journal entries that needed to be posted for the day. The first time I pressed that button, I was mesmerised. The screen started flashing, text started rolling down the spreadsheet, and then few moments later it was done. Looking at the spreadsheet, I felt like the kid at the party who is not allowed to eat sugar but who just got a piece of chocolate cake anyway. I wasn’t sure why I was excited, but I knew something good had just happened. The spreadsheet was filled with information — lots of information — and I couldn’t explain how it got there. I asked my manager how it worked, and he said he didn’t know (some intern from Wharton programmed the spreadsheet the year before).
Since I had never worked in corporate America before, I didn’t know that I was supposed to just keep pressing that button every day without asking how it worked. Instead, in my ignorance, I went out and bought “Learn VBA in 24 Hours” (which I still own today!) so that I could figure it out. I devoured the book and officially became a “programmer.”
I used my newfound knowledge to automate other parts of the job. Every time I had to manipulate some data more than once, I tried to talk to my manager about automating it. And like every good manager, he tried to figure out what words he could use to get me to stop bugging him. Despite his efforts, by the time I finished that internship I had turned a 40-hour-a-week job into an 18-hour-a-week job and I’m sure I used my downtime to learn another skill that was about to become mainstream: surfing the net.
I’ve continued to use Excel over the last 25 years and continue to be amazed by its power. It has its warts, sure. But it is an awe-inspiring piece of software. And it continues to act as the de facto format for exchanging (semi-?)structured information. But because Excel is so ubiquitous and its power so great, people abuse the program. Not out of malice or ignorance, but out of pragmatism. They know you can do just about anything with Excel, and they know you’ll be able to open the file and read the data when they send it to you. And that’s where this story really starts.
Abusing Excel, a Worldwide Pastime
As I noted above, I’ve used Excel for about 25 years. I know the program inside and out, but I also recognize it has its downsides 1. I don’t think that everyone shares my feelings, though. In my last job, I ran a company that did “government reporting” for clients. When I started to see some of the files that our clients would send us on a regular basis, I was often … bemused? I don’t know what the right word is, I just know that we would get Excel files from clients that we needed to process that were 100s of MBs and our clients were not willing (or able?) to send us data in another format, so we had to deal with it.
Unfortunately, I didn’t have a bunch of data engineers at my disposal. Most associates were accounting/finance graduates that would happily process files “manually” every time they received them. So we hired some Python programmers and I worked with them to try to tackle the problem using regular programming techniques instead. Unfortunately, most of the Python libraries we tried didn’t handle “big” Excel files very well. They could certainly handle the files, but it felt like it took too long. Especially when we had a new client or were running into problems. In some circumstances, it could take 30 seconds to process some of the bigger files. That’s not a tremendous amount of time, but it did feel “too long.” So I decided to try to find a way to do it faster.
Python’s Standard Library is Amazing
When I started to investigate how these libraries processed Excel files, I found that they all seemed to need to load the entire file before they could do anything. None could load, for instance, the first ten rows. But I found that you could leverage the Python standard library to do just that.
The first thing I needed to understand was that every xlsx file is just a zip file. No more, no less. That means that you can use a program like 7-zip to look at any xlsx file in your possession. If you start to play around with some files, you’ll see that they are not that complicated. There are a few key concepts:
- Every(?) file in the zip is an xml file.
xl/_rels/workbook.xml.relsfile gives you a “roadmap” for some of the key files in the zip.
- Strings aren’t stored directly in the worksheet xml files. Instead, they are
xl/sharedStrings.xml, so you need to deal with strings a bit differently than all other data types.
- Worksheet files (normally stored in
xl/worksheets/*) have rows (
row), cells (
c), and values (
- Cells (
c) have data types, but sometimes you need to leverage the information in the styles file (
xl/styles.xml) to get better data translations (e.g., mostly for dates).
With this knowledge in-hand, we can use the Python standard library to do everything we need. zipfiles will let us extract any of the individual files, and ElementTree offers a pull parser so that we can process elements as we see them. No need to load everything at the beginning if you’re willing to do some xml processing and data conversions yourself.
All this research resulted in the sxl library. It exploits the fact that xlsx files are really zip files, that all the files in the zip are xml files, and that each of the files in the zip are (probably?) organized in a very logical way 2. I’ve found that by leveraging each of these points, we could speed up how quickly we could process an Excel file sometimes by a factor of 10x.
Holy Sh** I Like Rust
sxl library is fine and it handles extremely large files with ease. But
you need to know Python to get it working. I wanted to develop a tool that you
could use without having to know Python, and I was debating whether I wanted to
use Rust or Nim. I had
played around with Nim before and I like it a lot. I hadn’t done much in Rust,
but it does seem to be what all the cool kids are using, so I thought I’d give
it a try. I’m glad I did.
“Translating” my code from Python to Rust wasn’t too difficult. I definitely had to create some types that felt a little “weird” 3, but other than that it was mostly very pleasant. And some things were more than pleasant, they were a real joy. Like the feeling I got when I’d been trying to defeat Bald Bull over and over, then finally figured out how to get the stars that let me knock him out. Here are some of the things that just I really enjoyed:
- Pattern matching. I’ve used pattern matching in other languages, but this is the first time that it really clicked for me. It made writing the xml parsing portion of the program much cleaner than the Python version IMO.
- Testing. Writing tests with a few extra annotations is great. And having documentation code automatically checked whenever you run tests is brilliant. Writing tests felt much easier in Rust than in any other language I’ve used.
- Documentation. I got started with The Rust
Book and I was able to get started
programming within a few hours. The rest of the documentation filled in
anything I couldn’t get from the book. Oh, and it’s all available offline
rustup docs. This goes for my documentation or the crates that I’m using, too! I can easily write comments in my program and view the “online” documentation for my library and all my dependencies with
cargo doc --open.
- Binaries. Like Go, you get a static binary that can be downloaded and ran without much hassle. The binaries are a very respectable size, too (the xlcat download is less than 500kb on windows and mac and less than 1mb on linux).
There were definitely more things that I liked about Rust, but I think these were the big ones. Ultimately, I was able to build the command-line version I wanted in a very reasonable amount of time. And I can download it on any platform I want without much fanfare.
How to Download and Use xlcat
If you want to try it out, grab a binary from the xlcat download page. Once it’s on your machine:
# List tabs in file ./xlcat <path-to-xlsx> # Print first 10 rows of file (comma-delimited) ./xlcat <path-to-xlsx> <tab> -n 10
It doesn’t have a ton of functionality yet, but I’ll enhance it as I have needs or if I hear from others that there is something they’d like to see. If you want to hear about updates I make to the tool (or about other things I work on), please join the mailing list in the form below.
Thanks for reading!
In fact, I tried to create a “competitor” of sorts to Excel about ten years ago, but that’s a story for another time. ↩︎
Specifically, we assume that writers of a worksheet xml file will write row 1 before they write row 2 (and so on), and that column “A” data will be written before column “B” data, and so on. ↩︎
For example, I’ve found that it is sometimes useful to get worksheets by name as well as by position/number. Rust doesn’t let you use algebraic data types like Typescript or OCaml. Instead, you create an
Traitto accomplish the same thing. I found this to be a little awkward, but I’ve only dabbled with the language for a bit over a week, so maybe it will grow on me or I’ll learn a better way to do it! ↩︎