I worked for a few years in the intersection between data science and software engineering. On the whole, it was a really enjoyable time and I’d like to have the chance to do so again at some point. One of the least enjoyable experiences from that time was to deal with big CSV exports. Unfortunately, this file format is still very common in the data science space. It is easy to understand why — it seems to be ubiquitous, present everywhere, it’s human-readable, it’s less verbose than options like JSON and XML, it’s super easy to produce from almost any tool. What’s not to like?
Robin Kaveland
I’m not going to pretend to be some sort of expert on this matter, but even as a casual it seems CSV isn’t exactly scalable to large data sets. It seems to work great for smaller exports and imports for personal use, but any more complicated matters it seems wholly unsuited for.
Very interesting read, will certainly look closer to parquet for future projects.
The type conversion is a bit challenging so I wrote a library to solve this problem. It can export any ResultSet to Parquet directly: https://manticore-projects.com/JDBCParquetWriter/index.html
I agree with the author’s point that CSV is under-specified. I also agree that CSV files aren’t technically optimal either.. But the real value with CSV comes from it’s ubiquity. Tools for working with them are readily available for developers as well as normal end users. I’m actually happy when I can my hands on CSV sources rather than some newfangled format.
As a software dev I’ve needed to deal with so many data exchange formats that it can get tiresome. Parquet feels like yet another implementation of the same darn thing. The authors always have good intentions of course, but it always creates more work to integrate and use it in practice.
Given the choice I probably opt for JSON, just because it’s not only very flexible but also because it’s well supported in programming tools – often directly mapping to native language structures. I do wish end user/non-programming tools had better support for JSON though. An end user wouldn’t know what to do with one.
Seconded. I deal with a lot of CSV’s. They’re easy, but have some flaws which at this point are well known and understood. I can hand it off to a non technical person and they can load them into a spreadsheet to review without having to warn them to install some custom app, or how to interpret the format ( like json/xml/yaml, etc) . parquet does not seem to be well supported in the tools I use. There exists a third party library of dubious quality for several languages I use. I wouldn’t want to be dependent on upgrading that library and all of its dependencies when the author stops updating it.
I think as long as you don’t move big data around it really does not matter and CSV can do a lot of things correctly.
However, parsing CSV is expensive because there is no type safety and when you have tables with billions of records (e.g. account balances of banks or insurance payments), then this really matters.
Some RDBMS allow Parquet as first class data storage, which is not possible with CSV.
I assure you the Banks still running their systems in COBOL on emulated AS/400s are most assuredly not using Parquet.
Banks are using space-separated CSVs. They usually don’t actually label them as CSVs but as text files (.TXT).
yeah, I mean the records are defined with COBOL data types. COBOL was made for processing them extremely fast. Processing anything else via COBOL will not likely be faster.
A recent entrant in this space: https://github.com/SixArm/usv
Its beside the point since it does not address the main problem: no type safety for the columns.
Only XML (with validation) and Parquet/Avro allow that, XLSX, CSV and friends are garbage in this regard (because its literally the wrong tool for this purpose).
I’m not sure replacing a very well known format with an obscure one is the way to go.
If you want interoperability, there is no alternative to CSV. It works in many platforms, including Excel, R, Jupyter, and others, and almost every language has a well known parser for it.
If you want to share the schema as well, there are existing solutions, like Xml, JSON, or if you need performance protobufs (which is a Google technology, but is used pretty much everywhere). And if you really need indexed access, there are sstables.
(Here is an open source clone: https://github.com/thomasjungblut/go-sstables/blob/main/sstables/README.md)
Anyway, it is an interesting read, however was not particularly convincing.
LOL. Why is it that so many people, peddling solutions looking for a problem, always start their narrative with the assumption that everyone else is wrong?
CSV is actually one of the few things that works as an universal standard for dataset sharing. And it is so because it is application/representation agnostic. Excel, MATALB, R, and python basically make it a non issue to work with these files.
Friends don’t let friends export to non printing character binary formats…. there fix that for ya lol.
Was thinking about this when the article was new, if Data types are really important, and compatibility and speed of reading, I might suggest Sqlite dbs. Very portable and almost universally supported. But people hate using sql for reasons not well understood.
Bill Shooter of Bul,
I love SQL and sqlite is pretty handy too. But as a data interchange format, that seems weird. Off the top of my head I’m not sure how easy it is to execute sqlite queries from another database without first exporting & importing the data again, which makes it cumbersome. You could technically transfer an SQL backup for the schema and blindly import that, but I would think that automatically loading a full SQL database from a potentially untrusted source could pose security risks.