Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

QuarkJets: Sep 8, 2008

Seventh Arrow posted:

Also, when I originally had thoughts of creating a data engineering thread, the subtitle was going to be "do you want that in CSV, CSV, or CSV?"

My preferred format is hdf5 to be honest, I've converted a few groups over from csv because it rules and is way faster/better than csv. Changing isn't always an option of course, but I love encountering projects that use hdf5

Why use hdf5 instead of csv?
1. It's natively supported by both pandas and the excellent h5py and pytables libraries, among others. With h5py datasets can be read directly from a file as numpy arrays. The hdf5 file structure representation in python is basically a dictionary
2. It's a binary format so numerical data is stored more compactly and efficiently
3. Data can be given a structure that's more useful than simple rows/columns. Individual datasets can be any size, you can create what are basically folders with more datasets inside of them, you can assign attributes wherever you need them (e.g. to designate units on a dataset, to leave comments on a group/folder, to complain about the weather in the file header, whatever you want!)
4. Datasets have a defined type, so pandas doesn't need to be told what datatype to expect (or doesn't have to infer a type from the data in each column)
5. Compression is natively supported at the dataset level and is completely transparent. The default built-in algorithm is gzip but you can use whatever compression scheme you want. That column full of one million 0s ballooning the size of your csv neatly compresses neatly to almost no space at all in a compressed hdf5 dataset
6. Data can be chunked for more efficient reading and writing

# ¿ Dec 31, 2023 04:28

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 19:02

QuarkJets: Sep 8, 2008

Hughmoris posted:

I heard that working with Oracle was an invitation to Planet Money. Is that not the case?

Maybe it was just for Oracle DBA training.

Every single person I know how has spent appreciable time working with oracle databases seems to seriously hate them, but I don't have any personal experience with Oracle DB so I don't really know why :shrug:

# ¿ Jan 3, 2024 18:21

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!