Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »

The Fool: Oct 16, 2003

I may not be understanding the problem, but docx files are just archives.

You should be able to extract one with any archive program, then read the xml file that contains the text content.

Phone posting, so I can't test.

# ? Jan 21, 2017 20:01

Adbot: ADBOT LOVES YOU

# ? Jun 13, 2024 04:10

Hughmoris: Apr 21, 2007; Let's go to the abyss!

A bit of a vague question here, mainly cause I've never really understood OOP.

When you're parsing CSVs, do you ever create objects for the data you're working with, or just use arrays/dicts?

# ? Jan 22, 2017 03:32

Nippashish: Nov 2, 2005; Let me see you dance!

Hughmoris posted:

A bit of a vague question here, mainly cause I've never really understood OOP.

When you're parsing CSVs, do you ever create objects for the data you're working with, or just use arrays/dicts?

If I'm working with CSVs I use pandas unless I have a really, really compelling reason not to.

# ? Jan 22, 2017 03:36

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Nippashish posted:

If I'm working with CSVs I use pandas unless I have a really, really compelling reason not to.

My problem is that I don't have a great grasp of Pandas, and a lot of what I'm doing is parsing data from 2 CSVs and trying to join/merge certain parts.

# ? Jan 22, 2017 04:10

Eela6: May 25, 2007; Shredded Hen

Csvs are a great use case for namedtuples, available in collections. In fact, the CSV module has an option for exactly that (I made the mistake of rolling my own before fully reading CSV) .

To be honest, for most cases there's nothing wrong with using a dictionary. I like to start with a dict or namedtuple and redactor to a class if things start getting complicated (think nested dictionaries or functions which seem more like methods)

# ? Jan 22, 2017 04:16

Nippashish: Nov 2, 2005; Let me see you dance!

While there's nothing wrong with not using pandas for working with tabular data, if you it's something you plan to do more than say, 5 times in your life, then it's worth investing the effort to learn how to do your thing in pandas. Pandas is very powerful, and while it makes some trivial things more complicated (because you need to learn how to use it) it makes so many other things into trivial operations that it's worth the initial investment to learn it.

# ? Jan 22, 2017 04:20

Eela6: May 25, 2007; Shredded Hen

A csv-data-handling problem was mentioned earlier in the thread here. There's a number of solutions following, including a slick pandas one and a namedtuples solution I wrote. Hope these give you some ideas.

Eela6 fucked around with this message at 04:28 on Jan 22, 2017

# ? Jan 22, 2017 04:24

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Eela6 posted:

A csv-data-handling problem was mentioned earlier in the thread here. There's a number of solutions following, including a slick pandas one and a namedtuples solution I wrote. Hope these give you some ideas.

Haha yep, I used your code to solve that problem and it worked beautifully. Been using those concepts to solve other projects I've tackled as well. Thanks!

# ? Jan 22, 2017 04:30

KernelSlanders: May 27, 2013; Rogue operating systems on occasion spread lies and rumors about me.

If you're trying to join / merge tables, you definitely should be using Pandas.

# ? Jan 22, 2017 19:08

Eela6: May 25, 2007; Shredded Hen

Hughmoris posted:

Haha yep, I used your code to solve that problem and it worked beautifully. Been using those concepts to solve other projects I've tackled as well. Thanks!

I'm glad it helped! It makes me feel really good.

# ? Jan 22, 2017 22:11

vodkat: Jun 30, 2012; cannot legally be sold as vodka

KernelSlanders posted:

If you're trying to join / merge tables, you definitely should be using Pandas.

Just wanting to jump on the band wagon here and echo that pandas is really great and powerful and I can't really think of any compelling reason why you would want to work with csv files in python and not use it in some form.

# ? Jan 23, 2017 11:41

Cingulate: Oct 23, 2012; by Fluffdaddy

On my Ubuntu server, whenever I try to install an R package, I get compile errors:

quote:

gcc: error: unrecognized command line option �-fstack-protector-strong�

It seems that this is because my default glib/libc comes from Anaconda, and Anaconda's versions are very old. What's the best way of dealing with this? I use R very rarely, but Python a lot.

# ? Jan 23, 2017 11:59

Beef: Jul 26, 2004

Try manually installing gcc to get a more recent one:

code:

sudo apt-get install gcc

# ? Jan 23, 2017 16:47

QuarkJets: Sep 8, 2008

vodkat posted:

Just wanting to jump on the band wagon here and echo that pandas is really great and powerful and I can't really think of any compelling reason why you would want to work with csv files in python and not use it in some form.

If you're writing something automated and need the extra performance, then it can be worth it to skip using Pandas. That's the only one I can think of though, Pandas is great for interactive analysis

The real answer though is you shouldn't store non-trivial amounts of data in CSV

# ? Jan 23, 2017 22:50

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Where does one even get CSV-formatted data nowadays? Excel exports?

# ? Jan 23, 2017 22:53

lifg: Dec 4, 2000; <this tag left blank>; Muldoon

The world runs on Excel. If I can even convince people to export Excel to CSV before uploading it I consider it a victory.

# ? Jan 23, 2017 22:59

Munkeymon: Aug 14, 2003; Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.

Thermopyle posted:

Where does one even get CSV-formatted data nowadays? Excel exports?

"See attached report"

Outside of that, in my experience, it's still a common data exchange format because it's so widely supported and 'easy', meaning the pain comes in dribs and drabs forever instead of being in the up-front setup effort.

# ? Jan 23, 2017 23:00

Cingulate: Oct 23, 2012; by Fluffdaddy

Thermopyle posted:

Where does one even get CSV-formatted data nowadays? Excel exports?

I write all my behavioral experimental result files in what amounts to CSV format, what'd be wrong with that?

# ? Jan 23, 2017 23:05

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Cingulate posted:

I write all my behavioral experimental result files in what amounts to CSV format, what'd be wrong with that?

I don't know that there's anything wrong with it other than the pain that comes from dealing with CSV...but every data interchange format has some sort of pain. I was just asking because I'm curious...it's been a decade since I've had to consume data in CSV.

# ? Jan 23, 2017 23:17

Beef: Jul 26, 2004

I'm seeing plenty of CSV. I am looking at a ML pipeline at the moment with about 20GB of CSV, compressed of course.

Intel DAAL supports only CSV input files out of the box, TensorFlow has a protobuf and csv reader. It's also the preferred way to read in tables/frames in R.

That said, there is plenty wrong with CSV. Stop, just please, stop. There is no standard or even RFC, everyone does whatever the gently caress they want, most CSV are not comma seperated and locales can gently caress up data if you do not double check the result manually.

Code horror anecdote: My alma mater had a student exam score database problem because it was exported/imported in CSV. Guess what happens in a locale where ',' is a thousands-seperator, such as all of europe.

edit: whoops, looks like my info is outdated: https://www.ietf.org/rfc/rfc4180.txt

# ? Jan 23, 2017 23:50

KernelSlanders: May 27, 2013; Rogue operating systems on occasion spread lies and rumors about me.

Thermopyle posted:

Where does one even get CSV-formatted data nowadays? Excel exports?

Database dumps and the output of various processes destined to be loaded into redis or other database tables because our VP of engineering wants the data to be "human readable".

# ? Jan 24, 2017 00:41

vodkat: Jun 30, 2012; cannot legally be sold as vodka

Thermopyle posted:

I don't know that there's anything wrong with it other than the pain that comes from dealing with CSV...but every data interchange format has some sort of pain. I was just asking because I'm curious...it's been a decade since I've had to consume data in CSV.

Most things academic, statistical or data sciency will prefer, or at the very least give you the option of, a csv.

# ? Jan 24, 2017 00:50

KernelSlanders: May 27, 2013; Rogue operating systems on occasion spread lies and rumors about me.

Also half the size of the equivalent json.

There are definitely better arbitrary serialization formats (avro) but you can do a lot worse. I'm curious what format Thermopyle is consuming data in.

# ? Jan 24, 2017 04:01

QuarkJets: Sep 8, 2008

HDF basically obsoletes the CSV format. If you're using tons and tons of CSVs, consider switching to HDF5. The benefits are innumerable; your files will be way smaller, way faster to read and write, and way more organized (an HDF5 file is like a little file system for your data that you can organize however you want, and compression is transparent and effortless). The downside is that you can't just poo poo out a ton of numbers into a text file

# ? Jan 24, 2017 05:13

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

KernelSlanders posted:

I'm curious what format Thermopyle is consuming data in.

Almost always pulled from a database.

# ? Jan 24, 2017 07:46

Nippashish: Nov 2, 2005; Let me see you dance!

QuarkJets posted:

The downside is that you can't just poo poo out a ton of numbers into a text file

And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want.

# ? Jan 24, 2017 09:42

Eela6: May 25, 2007; Shredded Hen

My approach is this:

If my data is a reasonably-sized (named)tuple, CSVs are fine. Once the data becomes complicated enough to need another data structure, CSVs are no longer sufficient.

# ? Jan 24, 2017 19:28

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

# ? Jan 24, 2017 20:54

accipter: Sep 12, 2003

Thermopyle posted:

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

In my field of strong motion seismology and earthquake engineering, CSV textfiles (and Excel spreadsheets) are the standard for all sorts of data because people can parse them with FORTRAN. That is starting to shift as more people are coming into the field with knowledge of more modern programming languages.

# ? Jan 24, 2017 21:11

Cingulate: Oct 23, 2012; by Fluffdaddy

Complaining about CSVs seems to me a bit like complaining about Angela Merkel. Sure, there's much to complain about, but there's this elephant in the room (Excel and proprietary stuff/Trump) that seems a much more pressing issue to most people.

# ? Jan 24, 2017 21:16

Beef: Jul 26, 2004

Data format chat!

I'm seeing quite a bit of sqlite as well, python has some great support for it that make it relatively transparent. You're still pulling in an entire db though, the overhead is stupid if you just want to load a simple table. HDF5 is typically a sign that the data scientist knows what he or she is doing.

accipter posted:

In my field of strong motion seismology and earthquake engineering, CSV textfiles (and Excel spreadsheets) are the standard for all sorts of data because people can parse them with FORTRAN. That is starting to shift as more people are coming into the field with knowledge of more modern programming languages.

The field of bioinformatics is similarly rife with ASCII-based file formats (SAM, FASTQ, ...). Two big reasons. First, Perl programmers and command-line heavy culture means that they flip their poo poo if they cannot 'head' or pipe a file. Second, and this is the big one, the gray-eminence tool developers that founded the field way back are absolutely averse to introducing dependencies. Any data format they cannot implement in C or Perl themselves easily is disregarded instantly (e.g. HDF5). For instance, a super-complicated gene-sequencing tool (bwa) only depends on gzip.

Yes folks, the human genome is stored as a zipped ASCII file.

# ? Jan 25, 2017 11:19

QuarkJets: Sep 8, 2008

Nippashish posted:

And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want.

Yeah, and that's why CSV is going to stick around forever. On the bright side I haven't used a language that doesn't have a really well-developed HDF5 library. For Python, h5py comes with all of the package managers by default and lets you treat everything like either a dict or a numpy array. And since it's all C under the hood and you're not reading a bunch of ASCII characters out of a text file, and because you only load into memory the arrays (or parts of arrays) that you asked for, it's fast as gently caress

Thermopyle posted:

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

I've tried this, the performance was not great on either end so I went back to using HDF5

(I use a MySQL database for all sorts of stuff but HDF5 works better for larger arrays; and in a few weird cases I have used ~~dark magicks~~ ODBC to access data in an HDF5 file using an SQL query!)

QuarkJets fucked around with this message at 14:15 on Jan 25, 2017

# ? Jan 25, 2017 14:12

LochNessMonster: Feb 3, 2005; I need about three fitty

I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve.

The html page I contains something like this:

HTML code:

div class="out-of-stock" qtlid="12345">
                                        Currently not in stock
                                        <br><span style="display: none;" qtlid="67890">
                                            Last in stock: {value}.
                                        </span>
</div>

When I visit the page in a browser I see a date / time where the html code says {value}. I assume this is something that should be filled by a JavaScript. Looking at the source I probably know (the path to) the script whch does this.

Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?

# ? Jan 25, 2017 20:01

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

LochNessMonster posted:

I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve.

The html page I contains something like this:
HTML code:
div class="out-of-stock" qtlid="12345">
                                        Currently not in stock
                                        <br><span style="display: none;" qtlid="67890">
                                            Last in stock: {value}.
                                        </span>
</div>
When I visit the page in a browser I see a date / time where the html code says {value}. I assume this is something that should be filled by a JavaScript. Looking at the source I probably know (the path to) the script whch does this.

Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?

The best way to do this is look at that javascript and see what it doing to get the date/time and then do that with python. So, maybe it's doing an AJAX request to some url, and you can do this with requests in python. This often fails because you've got to have specific cookies or something else the page sets up in its environment and it's a bitch to reverse engineer.

If it's too complicated, you can try PyExecJS, but that often fails because it's not running in a browser environment. Usually what I do when I get to this point is use PhantomJS with Selenium to do what I need to do...that usually means throwing away all your BS4 and requests code and doing it all in PhantomJS.

# ? Jan 25, 2017 20:16

LochNessMonster: Feb 3, 2005; I need about three fitty

I was hoping there'd be an easier solution, but since it's really just a nice to have I'm just gonna let this go.

Maybe I'll give it a try later on.

# ? Jan 25, 2017 23:47

Munkeymon: Aug 14, 2003; Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.

Thermopyle posted:

The best way to do this is look at that javascript and see what it doing to get the date/time and then do that with python. So, maybe it's doing an AJAX request to some url, and you can do this with requests in python. This often fails because you've got to have specific cookies or something else the page sets up in its environment and it's a bitch to reverse engineer.

If it's too complicated, you can try PyExecJS, but that often fails because it's not running in a browser environment. Usually what I do when I get to this point is use PhantomJS with Selenium to do what I need to do...that usually means throwing away all your BS4 and requests code and doing it all in PhantomJS.

You can also have Phantom/Slimer save the rendered DOM as HTML to disk pretty easily if there's no complicated login process to get to it.

# ? Jan 26, 2017 18:51

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Munkeymon posted:

You can also have Phantom/Slimer save the rendered DOM as HTML to disk pretty easily if there's no complicated login process to get to it.

Oh yeah, I forgot about this...which is funny because just last week I did it.

In all honesty, if a page has any javascript doing fetching or anything I just always go to using Phantom because it's just easier than trying to figure out wtf the page is actually doing. Of course, that comes with the downside of it being more resource-intensive, but that usually doesn't matter too much for me.

# ? Jan 26, 2017 19:13

LochNessMonster: Feb 3, 2005; I need about three fitty

Thanks for the additional info. Phantom and Slimer appear to be JS. I'm still getting to know my way around Python so I'll just stick with that for now.

# ? Jan 26, 2017 20:55

baka kaba: Jul 19, 2003; PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

You can use Selenium (in Python) to mess around with a PhantomJS webdriver, and use most of your BS4 parsing code to full stuff out of the resulting HTML. You're not actually touching any JavaScript (well unless you're messing around with PhantomJS's broken cookies like what happened when I tried it)

You can just let a web browser handle it too, put in whatever delays to let the page load and the JS mess around with the contents, parse the results. Unfortunately that seems like it's less simple than it used to be, GeckoDriver paths and what the hell ever, but it's definitely doable within Python once it's set up

# ? Jan 26, 2017 21:05

Adbot: ADBOT LOVES YOU

# ? Jun 13, 2024 04:10

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

LochNessMonster posted:

Thanks for the additional info. Phantom and Slimer appear to be JS. I'm still getting to know my way around Python so I'll just stick with that for now.

As baka kaba says and as I mentioned in my first post replying to you, you use Selenium to control Phantom. No need to use JS at all unless you need to do some more advanced stuff.

I actually kind of enjoy scraping with Phantom+Selenium in some ways. I've got a project now where I'm scraping a very JS-heavy site and I've created a classes where I can do stuff like:

Python code:

>>> page = PhantomAndSeleniumBackedPageAbstraction()
>>> page.fill_form(foo=1, bar=datetime.datetime.now())
>>> page.submit_form()
>>> print(page.form_results)

Where the class instance takes care of abstracting away a lot of messy poo poo you've got to deal with on this particular page...including running JS in Phantom because this page is impossible and stupid and I can't do exactly what I want from Selenium itself.

Anyway, sounds like you don't need to go down this path yet, but I'm just letting you know.

# ? Jan 26, 2017 21:37

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›484 »