|
I may not be understanding the problem, but docx files are just archives. You should be able to extract one with any archive program, then read the xml file that contains the text content. Phone posting, so I can't test.
|
# ? Jan 21, 2017 20:01 |
|
|
# ? Jun 13, 2024 04:10 |
|
A bit of a vague question here, mainly cause I've never really understood OOP. When you're parsing CSVs, do you ever create objects for the data you're working with, or just use arrays/dicts?
|
# ? Jan 22, 2017 03:32 |
|
Hughmoris posted:A bit of a vague question here, mainly cause I've never really understood OOP. If I'm working with CSVs I use pandas unless I have a really, really compelling reason not to.
|
# ? Jan 22, 2017 03:36 |
|
Nippashish posted:If I'm working with CSVs I use pandas unless I have a really, really compelling reason not to. My problem is that I don't have a great grasp of Pandas, and a lot of what I'm doing is parsing data from 2 CSVs and trying to join/merge certain parts.
|
# ? Jan 22, 2017 04:10 |
Csvs are a great use case for namedtuples, available in collections. In fact, the CSV module has an option for exactly that (I made the mistake of rolling my own before fully reading CSV) . To be honest, for most cases there's nothing wrong with using a dictionary. I like to start with a dict or namedtuple and redactor to a class if things start getting complicated (think nested dictionaries or functions which seem more like methods)
|
|
# ? Jan 22, 2017 04:16 |
|
While there's nothing wrong with not using pandas for working with tabular data, if you it's something you plan to do more than say, 5 times in your life, then it's worth investing the effort to learn how to do your thing in pandas. Pandas is very powerful, and while it makes some trivial things more complicated (because you need to learn how to use it) it makes so many other things into trivial operations that it's worth the initial investment to learn it.
|
# ? Jan 22, 2017 04:20 |
A csv-data-handling problem was mentioned earlier in the thread here. There's a number of solutions following, including a slick pandas one and a namedtuples solution I wrote. Hope these give you some ideas.
Eela6 fucked around with this message at 04:28 on Jan 22, 2017 |
|
# ? Jan 22, 2017 04:24 |
|
Eela6 posted:A csv-data-handling problem was mentioned earlier in the thread here. There's a number of solutions following, including a slick pandas one and a namedtuples solution I wrote. Hope these give you some ideas. Haha yep, I used your code to solve that problem and it worked beautifully. Been using those concepts to solve other projects I've tackled as well. Thanks!
|
# ? Jan 22, 2017 04:30 |
|
If you're trying to join / merge tables, you definitely should be using Pandas.
|
# ? Jan 22, 2017 19:08 |
Hughmoris posted:Haha yep, I used your code to solve that problem and it worked beautifully. Been using those concepts to solve other projects I've tackled as well. Thanks! I'm glad it helped! It makes me feel really good.
|
|
# ? Jan 22, 2017 22:11 |
|
KernelSlanders posted:If you're trying to join / merge tables, you definitely should be using Pandas. Just wanting to jump on the band wagon here and echo that pandas is really great and powerful and I can't really think of any compelling reason why you would want to work with csv files in python and not use it in some form.
|
# ? Jan 23, 2017 11:41 |
|
On my Ubuntu server, whenever I try to install an R package, I get compile errors:quote:gcc: error: unrecognized command line option ‘-fstack-protector-strong’ It seems that this is because my default glib/libc comes from Anaconda, and Anaconda's versions are very old. What's the best way of dealing with this? I use R very rarely, but Python a lot.
|
# ? Jan 23, 2017 11:59 |
|
Try manually installing gcc to get a more recent one: code:
|
# ? Jan 23, 2017 16:47 |
|
vodkat posted:Just wanting to jump on the band wagon here and echo that pandas is really great and powerful and I can't really think of any compelling reason why you would want to work with csv files in python and not use it in some form. If you're writing something automated and need the extra performance, then it can be worth it to skip using Pandas. That's the only one I can think of though, Pandas is great for interactive analysis The real answer though is you shouldn't store non-trivial amounts of data in CSV
|
# ? Jan 23, 2017 22:50 |
|
Where does one even get CSV-formatted data nowadays? Excel exports?
|
# ? Jan 23, 2017 22:53 |
|
The world runs on Excel. If I can even convince people to export Excel to CSV before uploading it I consider it a victory.
|
# ? Jan 23, 2017 22:59 |
|
Thermopyle posted:Where does one even get CSV-formatted data nowadays? Excel exports? "See attached report" Outside of that, in my experience, it's still a common data exchange format because it's so widely supported and 'easy', meaning the pain comes in dribs and drabs forever instead of being in the up-front setup effort.
|
# ? Jan 23, 2017 23:00 |
|
Thermopyle posted:Where does one even get CSV-formatted data nowadays? Excel exports?
|
# ? Jan 23, 2017 23:05 |
|
Cingulate posted:I write all my behavioral experimental result files in what amounts to CSV format, what'd be wrong with that? I don't know that there's anything wrong with it other than the pain that comes from dealing with CSV...but every data interchange format has some sort of pain. I was just asking because I'm curious...it's been a decade since I've had to consume data in CSV.
|
# ? Jan 23, 2017 23:17 |
|
I'm seeing plenty of CSV. I am looking at a ML pipeline at the moment with about 20GB of CSV, compressed of course. Intel DAAL supports only CSV input files out of the box, TensorFlow has a protobuf and csv reader. It's also the preferred way to read in tables/frames in R. That said, there is plenty wrong with CSV. Stop, just please, stop. There is no standard or even RFC, everyone does whatever the gently caress they want, most CSV are not comma seperated and locales can gently caress up data if you do not double check the result manually. Code horror anecdote: My alma mater had a student exam score database problem because it was exported/imported in CSV. Guess what happens in a locale where ',' is a thousands-seperator, such as all of europe. edit: whoops, looks like my info is outdated: https://www.ietf.org/rfc/rfc4180.txt
|
# ? Jan 23, 2017 23:50 |
|
Thermopyle posted:Where does one even get CSV-formatted data nowadays? Excel exports? Database dumps and the output of various processes destined to be loaded into redis or other database tables because our VP of engineering wants the data to be "human readable".
|
# ? Jan 24, 2017 00:41 |
|
Thermopyle posted:I don't know that there's anything wrong with it other than the pain that comes from dealing with CSV...but every data interchange format has some sort of pain. I was just asking because I'm curious...it's been a decade since I've had to consume data in CSV. Most things academic, statistical or data sciency will prefer, or at the very least give you the option of, a csv.
|
# ? Jan 24, 2017 00:50 |
|
Also half the size of the equivalent json. There are definitely better arbitrary serialization formats (avro) but you can do a lot worse. I'm curious what format Thermopyle is consuming data in.
|
# ? Jan 24, 2017 04:01 |
|
HDF basically obsoletes the CSV format. If you're using tons and tons of CSVs, consider switching to HDF5. The benefits are innumerable; your files will be way smaller, way faster to read and write, and way more organized (an HDF5 file is like a little file system for your data that you can organize however you want, and compression is transparent and effortless). The downside is that you can't just poo poo out a ton of numbers into a text file
|
# ? Jan 24, 2017 05:13 |
|
KernelSlanders posted:I'm curious what format Thermopyle is consuming data in. Almost always pulled from a database.
|
# ? Jan 24, 2017 07:46 |
|
QuarkJets posted:The downside is that you can't just poo poo out a ton of numbers into a text file And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want.
|
# ? Jan 24, 2017 09:42 |
My approach is this: If my data is a reasonably-sized (named)tuple, CSVs are fine. Once the data becomes complicated enough to need another data structure, CSVs are no longer sufficient.
|
|
# ? Jan 24, 2017 19:28 |
|
I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...
|
# ? Jan 24, 2017 20:54 |
|
Thermopyle posted:I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something... In my field of strong motion seismology and earthquake engineering, CSV textfiles (and Excel spreadsheets) are the standard for all sorts of data because people can parse them with FORTRAN. That is starting to shift as more people are coming into the field with knowledge of more modern programming languages.
|
# ? Jan 24, 2017 21:11 |
|
Complaining about CSVs seems to me a bit like complaining about Angela Merkel. Sure, there's much to complain about, but there's this elephant in the room (Excel and proprietary stuff/Trump) that seems a much more pressing issue to most people.
|
# ? Jan 24, 2017 21:16 |
|
Data format chat! I'm seeing quite a bit of sqlite as well, python has some great support for it that make it relatively transparent. You're still pulling in an entire db though, the overhead is stupid if you just want to load a simple table. HDF5 is typically a sign that the data scientist knows what he or she is doing. accipter posted:In my field of strong motion seismology and earthquake engineering, CSV textfiles (and Excel spreadsheets) are the standard for all sorts of data because people can parse them with FORTRAN. That is starting to shift as more people are coming into the field with knowledge of more modern programming languages. The field of bioinformatics is similarly rife with ASCII-based file formats (SAM, FASTQ, ...). Two big reasons. First, Perl programmers and command-line heavy culture means that they flip their poo poo if they cannot 'head' or pipe a file. Second, and this is the big one, the gray-eminence tool developers that founded the field way back are absolutely averse to introducing dependencies. Any data format they cannot implement in C or Perl themselves easily is disregarded instantly (e.g. HDF5). For instance, a super-complicated gene-sequencing tool (bwa) only depends on gzip. Yes folks, the human genome is stored as a zipped ASCII file.
|
# ? Jan 25, 2017 11:19 |
|
Nippashish posted:And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want. Yeah, and that's why CSV is going to stick around forever. On the bright side I haven't used a language that doesn't have a really well-developed HDF5 library. For Python, h5py comes with all of the package managers by default and lets you treat everything like either a dict or a numpy array. And since it's all C under the hood and you're not reading a bunch of ASCII characters out of a text file, and because you only load into memory the arrays (or parts of arrays) that you asked for, it's fast as gently caress Thermopyle posted:I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something... I've tried this, the performance was not great on either end so I went back to using HDF5 (I use a MySQL database for all sorts of stuff but HDF5 works better for larger arrays; and in a few weird cases I have used QuarkJets fucked around with this message at 14:15 on Jan 25, 2017 |
# ? Jan 25, 2017 14:12 |
|
I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve. The html page I contains something like this: HTML code:
Is there an easy way to let my python app run the specific JavaScript so I can scrape the date/time?
|
# ? Jan 25, 2017 20:01 |
|
LochNessMonster posted:I'm still playing with requests and bs4 and am running into an issue I'm not sure how to solve. The best way to do this is look at that javascript and see what it doing to get the date/time and then do that with python. So, maybe it's doing an AJAX request to some url, and you can do this with requests in python. This often fails because you've got to have specific cookies or something else the page sets up in its environment and it's a bitch to reverse engineer. If it's too complicated, you can try PyExecJS, but that often fails because it's not running in a browser environment. Usually what I do when I get to this point is use PhantomJS with Selenium to do what I need to do...that usually means throwing away all your BS4 and requests code and doing it all in PhantomJS.
|
# ? Jan 25, 2017 20:16 |
|
I was hoping there'd be an easier solution, but since it's really just a nice to have I'm just gonna let this go. Maybe I'll give it a try later on.
|
# ? Jan 25, 2017 23:47 |
|
Thermopyle posted:The best way to do this is look at that javascript and see what it doing to get the date/time and then do that with python. So, maybe it's doing an AJAX request to some url, and you can do this with requests in python. This often fails because you've got to have specific cookies or something else the page sets up in its environment and it's a bitch to reverse engineer. You can also have Phantom/Slimer save the rendered DOM as HTML to disk pretty easily if there's no complicated login process to get to it.
|
# ? Jan 26, 2017 18:51 |
|
Munkeymon posted:You can also have Phantom/Slimer save the rendered DOM as HTML to disk pretty easily if there's no complicated login process to get to it. Oh yeah, I forgot about this...which is funny because just last week I did it. In all honesty, if a page has any javascript doing fetching or anything I just always go to using Phantom because it's just easier than trying to figure out wtf the page is actually doing. Of course, that comes with the downside of it being more resource-intensive, but that usually doesn't matter too much for me.
|
# ? Jan 26, 2017 19:13 |
|
Thanks for the additional info. Phantom and Slimer appear to be JS. I'm still getting to know my way around Python so I'll just stick with that for now.
|
# ? Jan 26, 2017 20:55 |
|
You can use Selenium (in Python) to mess around with a PhantomJS webdriver, and use most of your BS4 parsing code to full stuff out of the resulting HTML. You're not actually touching any JavaScript (well unless you're messing around with PhantomJS's broken cookies like what happened when I tried it) You can just let a web browser handle it too, put in whatever delays to let the page load and the JS mess around with the contents, parse the results. Unfortunately that seems like it's less simple than it used to be, GeckoDriver paths and what the hell ever, but it's definitely doable within Python once it's set up
|
# ? Jan 26, 2017 21:05 |
|
|
# ? Jun 13, 2024 04:10 |
|
LochNessMonster posted:Thanks for the additional info. Phantom and Slimer appear to be JS. I'm still getting to know my way around Python so I'll just stick with that for now. As baka kaba says and as I mentioned in my first post replying to you, you use Selenium to control Phantom. No need to use JS at all unless you need to do some more advanced stuff. I actually kind of enjoy scraping with Phantom+Selenium in some ways. I've got a project now where I'm scraping a very JS-heavy site and I've created a classes where I can do stuff like: Python code:
Anyway, sounds like you don't need to go down this path yet, but I'm just letting you know.
|
# ? Jan 26, 2017 21:37 |