Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
The Fool
Oct 16, 2003


I may not be understanding the problem, but docx files are just archives.

You should be able to extract one with any archive program, then read the xml file that contains the text content.

Phone posting, so I can't test.

Adbot
ADBOT LOVES YOU

Hughmoris
Apr 21, 2007
Let's go to the abyss!
A bit of a vague question here, mainly cause I've never really understood OOP.

When you're parsing CSVs, do you ever create objects for the data you're working with, or just use arrays/dicts?

Nippashish
Nov 2, 2005

Let me see you dance!

Hughmoris posted:

A bit of a vague question here, mainly cause I've never really understood OOP.

When you're parsing CSVs, do you ever create objects for the data you're working with, or just use arrays/dicts?

If I'm working with CSVs I use pandas unless I have a really, really compelling reason not to.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Nippashish posted:

If I'm working with CSVs I use pandas unless I have a really, really compelling reason not to.

My problem is that I don't have a great grasp of Pandas, and a lot of what I'm doing is parsing data from 2 CSVs and trying to join/merge certain parts.

Eela6
May 25, 2007
Shredded Hen
Csvs are a great use case for namedtuples, available in collections. In fact, the CSV module has an option for exactly that (I made the mistake of rolling my own before fully reading CSV) .

To be honest, for most cases there's nothing wrong with using a dictionary. I like to start with a dict or namedtuple and redactor to a class if things start getting complicated (think nested dictionaries or functions which seem more like methods)

Nippashish
Nov 2, 2005

Let me see you dance!
While there's nothing wrong with not using pandas for working with tabular data, if you it's something you plan to do more than say, 5 times in your life, then it's worth investing the effort to learn how to do your thing in pandas. Pandas is very powerful, and while it makes some trivial things more complicated (because you need to learn how to use it) it makes so many other things into trivial operations that it's worth the initial investment to learn it.

Eela6
May 25, 2007
Shredded Hen
A csv-data-handling problem was mentioned earlier in the thread here. There's a number of solutions following, including a slick pandas one and a namedtuples solution I wrote. Hope these give you some ideas.

Eela6 fucked around with this message at 04:28 on Jan 22, 2017

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Eela6 posted:

A csv-data-handling problem was mentioned earlier in the thread here. There's a number of solutions following, including a slick pandas one and a namedtuples solution I wrote. Hope these give you some ideas.

Haha yep, I used your code to solve that problem and it worked beautifully. Been using those concepts to solve other projects I've tackled as well. Thanks!

KernelSlanders
May 27, 2013

Rogue operating systems on occasion spread lies and rumors about me.
If you're trying to join / merge tables, you definitely should be using Pandas.

Eela6
May 25, 2007
Shredded Hen

Hughmoris posted:

Haha yep, I used your code to solve that problem and it worked beautifully. Been using those concepts to solve other projects I've tackled as well. Thanks!

I'm glad it helped! It makes me feel really good. :)

vodkat
Jun 30, 2012



cannot legally be sold as vodka

KernelSlanders posted:

If you're trying to join / merge tables, you definitely should be using Pandas.

Just wanting to jump on the band wagon here and echo that pandas is really great and powerful and I can't really think of any compelling reason why you would want to work with csv files in python and not use it in some form.

Cingulate
Oct 23, 2012

by Fluffdaddy
On my Ubuntu server, whenever I try to install an R package, I get compile errors:

quote:

gcc: error: unrecognized command line option ‘-fstack-protector-strong’

It seems that this is because my default glib/libc comes from Anaconda, and Anaconda's versions are very old. What's the best way of dealing with this? I use R very rarely, but Python a lot.

Beef
Jul 26, 2004
Try manually installing gcc to get a more recent one:
code:
sudo apt-get install gcc

QuarkJets
Sep 8, 2008

vodkat posted:

Just wanting to jump on the band wagon here and echo that pandas is really great and powerful and I can't really think of any compelling reason why you would want to work with csv files in python and not use it in some form.

If you're writing something automated and need the extra performance, then it can be worth it to skip using Pandas. That's the only one I can think of though, Pandas is great for interactive analysis

The real answer though is you shouldn't store non-trivial amounts of data in CSV

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Where does one even get CSV-formatted data nowadays? Excel exports?

lifg
Dec 4, 2000
<this tag left blank>
Muldoon
The world runs on Excel. If I can even convince people to export Excel to CSV before uploading it I consider it a victory.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Thermopyle posted:

Where does one even get CSV-formatted data nowadays? Excel exports?

"See attached report"

Outside of that, in my experience, it's still a common data exchange format because it's so widely supported and 'easy', meaning the pain comes in dribs and drabs forever instead of being in the up-front setup effort.

Cingulate
Oct 23, 2012

by Fluffdaddy

Thermopyle posted:

Where does one even get CSV-formatted data nowadays? Excel exports?
I write all my behavioral experimental result files in what amounts to CSV format, what'd be wrong with that?

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Cingulate posted:

I write all my behavioral experimental result files in what amounts to CSV format, what'd be wrong with that?

I don't know that there's anything wrong with it other than the pain that comes from dealing with CSV...but every data interchange format has some sort of pain. I was just asking because I'm curious...it's been a decade since I've had to consume data in CSV.

Beef
Jul 26, 2004
I'm seeing plenty of CSV. I am looking at a ML pipeline at the moment with about 20GB of CSV, compressed of course.

Intel DAAL supports only CSV input files out of the box, TensorFlow has a protobuf and csv reader. It's also the preferred way to read in tables/frames in R.


That said, there is plenty wrong with CSV. Stop, just please, stop. There is no standard or even RFC, everyone does whatever the gently caress they want, most CSV are not comma seperated and locales can gently caress up data if you do not double check the result manually.

Code horror anecdote: My alma mater had a student exam score database problem because it was exported/imported in CSV. Guess what happens in a locale where ',' is a thousands-seperator, such as all of europe.

edit: whoops, looks like my info is outdated: https://www.ietf.org/rfc/rfc4180.txt

KernelSlanders
May 27, 2013

Rogue operating systems on occasion spread lies and rumors about me.

Thermopyle posted:

Where does one even get CSV-formatted data nowadays? Excel exports?

Database dumps and the output of various processes destined to be loaded into redis or other database tables because our VP of engineering wants the data to be "human readable".

vodkat
Jun 30, 2012



cannot legally be sold as vodka

Thermopyle posted:

I don't know that there's anything wrong with it other than the pain that comes from dealing with CSV...but every data interchange format has some sort of pain. I was just asking because I'm curious...it's been a decade since I've had to consume data in CSV.

Most things academic, statistical or data sciency will prefer, or at the very least give you the option of, a csv.

KernelSlanders
May 27, 2013

Rogue operating systems on occasion spread lies and rumors about me.
Also half the size of the equivalent json.

There are definitely better arbitrary serialization formats (avro) but you can do a lot worse. I'm curious what format Thermopyle is consuming data in.

QuarkJets
Sep 8, 2008

HDF basically obsoletes the CSV format. If you're using tons and tons of CSVs, consider switching to HDF5. The benefits are innumerable; your files will be way smaller, way faster to read and write, and way more organized (an HDF5 file is like a little file system for your data that you can organize however you want, and compression is transparent and effortless). The downside is that you can't just poo poo out a ton of numbers into a text file

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

KernelSlanders posted:

I'm curious what format Thermopyle is consuming data in.

Almost always pulled from a database.

Nippashish
Nov 2, 2005

Let me see you dance!

QuarkJets posted:

The downside is that you can't just poo poo out a ton of numbers into a text file

And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want.

Eela6
May 25, 2007
Shredded Hen
My approach is this:

If my data is a reasonably-sized (named)tuple, CSVs are fine. Once the data becomes complicated enough to need another data structure, CSVs are no longer sufficient.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

accipter
Sep 12, 2003

Thermopyle posted:

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

In my field of strong motion seismology and earthquake engineering, CSV textfiles (and Excel spreadsheets) are the standard for all sorts of data because people can parse them with FORTRAN. That is starting to shift as more people are coming into the field with knowledge of more modern programming languages.

Cingulate
Oct 23, 2012

by Fluffdaddy
Complaining about CSVs seems to me a bit like complaining about Angela Merkel. Sure, there's much to complain about, but there's this elephant in the room (Excel and proprietary stuff/Trump) that seems a much more pressing issue to most people.

Beef
Jul 26, 2004
Data format chat!

I'm seeing quite a bit of sqlite as well, python has some great support for it that make it relatively transparent. You're still pulling in an entire db though, the overhead is stupid if you just want to load a simple table. HDF5 is typically a sign that the data scientist knows what he or she is doing.


accipter posted:

In my field of strong motion seismology and earthquake engineering, CSV textfiles (and Excel spreadsheets) are the standard for all sorts of data because people can parse them with FORTRAN. That is starting to shift as more people are coming into the field with knowledge of more modern programming languages.

The field of bioinformatics is similarly rife with ASCII-based file formats (SAM, FASTQ, ...). Two big reasons. First, Perl programmers and command-line heavy culture means that they flip their poo poo if they cannot 'head' or pipe a file. Second, and this is the big one, the gray-eminence tool developers that founded the field way back are absolutely averse to introducing dependencies. Any data format they cannot implement in C or Perl themselves easily is disregarded instantly (e.g. HDF5). For instance, a super-complicated gene-sequencing tool (bwa) only depends on gzip.

Yes folks, the human genome is stored as a zipped ASCII file.

QuarkJets
Sep 8, 2008

Nippashish posted:

And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want.

Yeah, and that's why CSV is going to stick around forever. On the bright side I haven't used a language that doesn't have a really well-developed HDF5 library. For Python, h5py comes with all of the package managers by default and lets you treat everything like either a dict or a numpy array. And since it's all C under the hood and you're not reading a bunch of ASCII characters out of a text file, and because you only load into memory the arrays (or parts of arrays) that you asked for, it's fast as gently caress

Thermopyle posted:

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

I've tried this, the performance was not great on either end so I went back to using HDF5

(I use a MySQL database for all sorts of stuff but HDF5 works better for larger arrays; and in a few weird cases I have used dark magicks ODBC to access data in an HDF5 file using an SQL query!)

QuarkJets fucked around with this message at 14:15 on Jan 25, 2017

LochNessMonster
Feb 3, 2005

I need about three fitty


I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve.

The html page I contains something like this:

HTML code:
div class="out-of-stock" qtlid="12345">
                                        Currently not in stock
                                        <br><span style="display: none;" qtlid="67890">
                                            Last in stock: {value}.
                                        </span>
</div>
When I visit the page in a browser I see a date / time where the html code says {value}. I assume this is something that should be filled by a JavaScript. Looking at the source I probably know (the path to) the script whch does this.

Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

LochNessMonster posted:

I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve.

The html page I contains something like this:

HTML code:
div class="out-of-stock" qtlid="12345">
                                        Currently not in stock
                                        <br><span style="display: none;" qtlid="67890">
                                            Last in stock: {value}.
                                        </span>
</div>
When I visit the page in a browser I see a date / time where the html code says {value}. I assume this is something that should be filled by a JavaScript. Looking at the source I probably know (the path to) the script whch does this.

Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?

The best way to do this is look at that javascript and see what it doing to get the date/time and then do that with python. So, maybe it's doing an AJAX request to some url, and you can do this with requests in python. This often fails because you've got to have specific cookies or something else the page sets up in its environment and it's a bitch to reverse engineer.

If it's too complicated, you can try PyExecJS, but that often fails because it's not running in a browser environment. Usually what I do when I get to this point is use PhantomJS with Selenium to do what I need to do...that usually means throwing away all your BS4 and requests code and doing it all in PhantomJS.

LochNessMonster
Feb 3, 2005

I need about three fitty


I was hoping there'd be an easier solution, but since it's really just a nice to have I'm just gonna let this go.

Maybe I'll give it a try later on.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Thermopyle posted:

The best way to do this is look at that javascript and see what it doing to get the date/time and then do that with python. So, maybe it's doing an AJAX request to some url, and you can do this with requests in python. This often fails because you've got to have specific cookies or something else the page sets up in its environment and it's a bitch to reverse engineer.

If it's too complicated, you can try PyExecJS, but that often fails because it's not running in a browser environment. Usually what I do when I get to this point is use PhantomJS with Selenium to do what I need to do...that usually means throwing away all your BS4 and requests code and doing it all in PhantomJS.

You can also have Phantom/Slimer save the rendered DOM as HTML to disk pretty easily if there's no complicated login process to get to it.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Munkeymon posted:

You can also have Phantom/Slimer save the rendered DOM as HTML to disk pretty easily if there's no complicated login process to get to it.

Oh yeah, I forgot about this...which is funny because just last week I did it.


In all honesty, if a page has any javascript doing fetching or anything I just always go to using Phantom because it's just easier than trying to figure out wtf the page is actually doing. Of course, that comes with the downside of it being more resource-intensive, but that usually doesn't matter too much for me.

LochNessMonster
Feb 3, 2005

I need about three fitty


Thanks for the additional info. Phantom and Slimer appear to be JS. I'm still getting to know my way around Python so I'll just stick with that for now.

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

You can use Selenium (in Python) to mess around with a PhantomJS webdriver, and use most of your BS4 parsing code to full stuff out of the resulting HTML. You're not actually touching any JavaScript (well unless you're messing around with PhantomJS's broken cookies like what happened when I tried it)

You can just let a web browser handle it too, put in whatever delays to let the page load and the JS mess around with the contents, parse the results. Unfortunately that seems like it's less simple than it used to be, GeckoDriver paths and what the hell ever, but it's definitely doable within Python once it's set up

Adbot
ADBOT LOVES YOU

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

LochNessMonster posted:

Thanks for the additional info. Phantom and Slimer appear to be JS. I'm still getting to know my way around Python so I'll just stick with that for now.

As baka kaba says and as I mentioned in my first post replying to you, you use Selenium to control Phantom. No need to use JS at all unless you need to do some more advanced stuff.

I actually kind of enjoy scraping with Phantom+Selenium in some ways. I've got a project now where I'm scraping a very JS-heavy site and I've created a classes where I can do stuff like:

Python code:
>>> page = PhantomAndSeleniumBackedPageAbstraction()
>>> page.fill_form(foo=1, bar=datetime.datetime.now())
>>> page.submit_form()
>>> print(page.form_results)
Where the class instance takes care of abstracting away a lot of messy poo poo you've got to deal with on this particular page...including running JS in Phantom because this page is impossible and stupid and I can't do exactly what I want from Selenium itself.

Anyway, sounds like you don't need to go down this path yet, but I'm just letting you know.

  • Locked thread