Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
QuarkJets
Sep 8, 2008

Seventh Arrow posted:

Also, when I originally had thoughts of creating a data engineering thread, the subtitle was going to be "do you want that in CSV, CSV, or CSV?"

My preferred format is hdf5 to be honest, I've converted a few groups over from csv because it rules and is way faster/better than csv. Changing isn't always an option of course, but I love encountering projects that use hdf5

Why use hdf5 instead of csv?
1. It's natively supported by both pandas and the excellent h5py and pytables libraries, among others. With h5py datasets can be read directly from a file as numpy arrays. The hdf5 file structure representation in python is basically a dictionary
2. It's a binary format so numerical data is stored more compactly and efficiently
3. Data can be given a structure that's more useful than simple rows/columns. Individual datasets can be any size, you can create what are basically folders with more datasets inside of them, you can assign attributes wherever you need them (e.g. to designate units on a dataset, to leave comments on a group/folder, to complain about the weather in the file header, whatever you want!)
4. Datasets have a defined type, so pandas doesn't need to be told what datatype to expect (or doesn't have to infer a type from the data in each column)
5. Compression is natively supported at the dataset level and is completely transparent. The default built-in algorithm is gzip but you can use whatever compression scheme you want. That column full of one million 0s ballooning the size of your csv neatly compresses neatly to almost no space at all in a compressed hdf5 dataset
6. Data can be chunked for more efficient reading and writing

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Hughmoris posted:

I heard that working with Oracle was an invitation to Planet Money. Is that not the case?

Maybe it was just for Oracle DBA training.

Every single person I know how has spent appreciable time working with oracle databases seems to seriously hate them, but I don't have any personal experience with Oracle DB so I don't really know why :shrug:

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply