|
HDF basically obsoletes the CSV format. If you're using tons and tons of CSVs, consider switching to HDF5. The benefits are innumerable; your files will be way smaller, way faster to read and write, and way more organized (an HDF5 file is like a little file system for your data that you can organize however you want, and compression is transparent and effortless). The downside is that you can't just poo poo out a ton of numbers into a text file
|
# ¿ Jan 24, 2017 05:13 |
|
|
# ¿ May 8, 2024 16:42 |
|
Nippashish posted:And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want. Yeah, and that's why CSV is going to stick around forever. On the bright side I haven't used a language that doesn't have a really well-developed HDF5 library. For Python, h5py comes with all of the package managers by default and lets you treat everything like either a dict or a numpy array. And since it's all C under the hood and you're not reading a bunch of ASCII characters out of a text file, and because you only load into memory the arrays (or parts of arrays) that you asked for, it's fast as gently caress Thermopyle posted:I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something... I've tried this, the performance was not great on either end so I went back to using HDF5 (I use a MySQL database for all sorts of stuff but HDF5 works better for larger arrays; and in a few weird cases I have used QuarkJets fucked around with this message at 14:15 on Jan 25, 2017 |
# ¿ Jan 25, 2017 14:12 |
|
I hate getting assigned to a project and discovering that they wrote their own half-baked daemon and gave it a special service account just to run a process that has to execute every 10 minutes
|
# ¿ Jan 30, 2017 23:51 |
|
Eela6 posted:Thanks! I picked up those tricks from Luciano Ramalho's Fluent Python and Brett Slatkin's Effective Python, respectively. "Functions should have as few arguments as possible" gets introduced on slide 41 but it probably belongs in the list of Function rules on slides 21-26. This is basically just another function rule, is it not? I think "needless identifiers" is a misnomer; _ is just a variable name like anything else. I understand that some people like to use _ to indicate "this variable is not used" but that's still an identifier. I'm not sure what to actually call it
|
# ¿ Feb 1, 2017 01:26 |
|
Nippashish posted:I propose we rename 'zip' to 'as_tuples_by_index' because 'zip' is unnecessarily cryptic, and fails to be self-explanatory unless you already know what it's supposed to mean. I think 'zip' is actually pretty simple to understand if you've ever seen or used a zipper before
|
# ¿ Feb 1, 2017 23:31 |
|
Nippashish posted:zip is intuitive if you know what it means, but if you'd never encountered it before you might not know what it did just from the name. My argument is that _ is the same. But only one of those references something in the real world. The other is an arbitrarily-decided convention I'm not saying that _ is a bad convention, but your analogy is really weak
|
# ¿ Feb 2, 2017 00:37 |
|
It's also worth mentioning that _ is usually used as an alias for the gettext function so you're also committing the sin of shadowing a function with a variable, but _ is a terrible function name. Still, it's easy to get into trouble if you're working on code that uses this badly-named alias Pylint regards any local variable with an underscore in front of it (such as _unused, _foo, _bar) as unused, which seems like it's the most Pythonic solution Python code:
Note that _ isn't mentioned in PEP8 anywhere, I think it's the kind of convention that just kind of spread by word of mouth and is generally safe to ignore for those who want to ignore it
|
# ¿ Feb 2, 2017 00:55 |
|
Thermopyle posted:That's because it's a programming convention, not a Python convention. It's used in probably dozens of languages. Okay you're exaggerating though, right? If you saw someone assign something to a variable named "unused" surely you wouldn't clutch your head in confusion, instead you'd probably think "oh this variable must not be used by anything" regardless of your experience level
|
# ¿ Feb 3, 2017 03:56 |
|
baka kaba posted:Maybe it's a boolean that contains information about whether something is used? Or a collection of unused things from a pool? Like, is the value you're assigning part of an object's state, or something you're going to be ignoring? You can make an assumption, but how sure are you without further investigation? Neither of those arguments are working for me, as they both also apply against "_" and for "unused" respectively The "_" character is even less descriptive than "unused", and so it could store literally anything. Case in point, _ is already aliased to the gettext function, and in the interpreter it maps to the value returned by the previous command. Even if the convention is to store unneeded data in "_", that doesn't necessarily make it a good idea; using a different convention might be better I've seen "unused" used to hold unused returns many times, so apparently that is a convention as well, although possibly it's just more commonly used in scientific computing than in other fields. But still, it's conventional, so your point about the positives of convention applies to it as well. I'd like to re-raise the idea of always labeling returned variables descriptively like you would have done anyway but prefixing those names with "_" when they're unused. Pylint already supports this, it's more explicit, it's more readable, and it still carries the benefits of conventions. It seems like this is the best of all possible worlds, the only downside being that you have to type a few more characters, which seems pretty negligible.
|
# ¿ Feb 3, 2017 06:03 |
|
Eela6 posted:The problem is that prefixing variables with a single underscore already means something by python convention. It's signifies a "protected" variable that's internal to the module and not meant to be used by a public consumer of the function. That's actually consistent though; just as you shouldn't use a variable marked as unused, you also shouldn't use a variable marked as "protected". So that's not a problem. You also shouldn't want to import an unused variable, so that's all the more reason to label unused variables with leading underscores! I'm sure that "_" is already captured in this rule, and switching to using "_" as a prefix character wouldn't change the behavior, which is good. So... no downsides so far except "it's not already the convention"
|
# ¿ Feb 3, 2017 06:50 |
|
But explicit is better than implicit, right? Doesn't _ go against that?
|
# ¿ Feb 3, 2017 10:14 |
|
baka kaba posted:Flat is better than nested That's not what explicit means, there is no way that _ could be considered more explicit than a descriptive variable name. And again any argument you make about conventions (which you're now calling "a tool") applies equally well to using _ as a prefix (which is already a convention/tool in pylint). I'm not sure why you're against using _ as a prefix for marking unused variables, it's already an in-use convention, it's in accordance with the other ways that underscores are usually used, and it's friendlier to people on the outside who may have never seen the convention before (such as C programmers). QuarkJets fucked around with this message at 13:19 on Feb 3, 2017 |
# ¿ Feb 3, 2017 13:02 |
|
Nippashish posted:Its almost like clarity of different conventions depends on which conventions you're used to seeing and using yourself. This. Even just in this thread we've seen several posters who had either not seen the "_ as unused" convention, had seen a similar but different convention, or who knew of the convention but don't use it because they think that a different convention is better. These all seem like okay states to be in and I don't see anything wrong with someone using an alternative if they think that's better, especially when that alternative is already recognized in pylint We are talking about something that is usually inconsequential (variable name) that contains something that is definitely inconsequential (unused return value) so maybe it's just more important that a codebase is internally consistent than that it aligns with an unstated convention ported over from other languages?
|
# ¿ Feb 5, 2017 02:35 |
|
Thermopyle posted:I think this is the first time this argument has been brought up against the "_" in this discussion which seems odd if it's the best argument or right-est argument. Pylint recognizes other conventions as being okay, why won't you?
|
# ¿ Feb 5, 2017 03:14 |
|
Thermopyle posted:I'm not sure you understand what you highlighted in my quote means? In the text that I highlighted you're subtly implying that using _ for unused variables is objectively better than other conventions. It's the "vim vs emacs" of variable naming; both options are fine and if you're starting a new codebase then it's fine to use whatever you want
|
# ¿ Feb 6, 2017 04:58 |
|
Fusion Restaurant posted:What IDEs do people like for doing data stuff in Python? I've been just using Spyder or Jupyter notebooks pretty much because they were the default options w/ Anaconda. Data scientist checking in PyCharm is not really made with data analysis in mind but it is still a really great IDE. It's what I use when it's available and when I'm not being lazy/opening Vim for a quick thing. Spyder is solid but debugging in it is a real chore compared to PyCharm. I think it also doesn't have native version control and that its code refactoring tools are per-file rather than per-project, but I could be wrong; I haven't used it in a long time. I think it's only as popular as it is because it comes with Anaconda and PyCharm doesn't, so it's sort of the path of least resistance. Jupyter notebooks are cool but I don't really use them, so I can't comment on their effectiveness as an IDE. Creating a new notebook felt clunky and weird when I was playing around with it. However, I do know that PyCharm lets you open Jupyter notebooks, so it seems like you should be able to have the best of both worlds if a notebook suits your needs
|
# ¿ Feb 6, 2017 09:47 |
|
Rosalind posted:My questions now are: 1. How the heck does anyone ever learn how to use this? I'm the first to admit that I'm no programming guru, but I have some experience with several different languages and actually getting started with Python makes absolutely no sense. Everything feels like the biggest clusterfuck of dependencies and "install X, Y, and Z to get W to work but to get X to work install A and B which require C to run on Windows." 2. Is there a simple, idiot-proof (because that's what I am apparently) guide to moving from R to Python for data analysis and hopefully eventually machine learning? 1a. Learning Python is like learning any other language, including R. Use it with whatever project you need to currently do, work through issues, ask questions on the internet, ask about best practices as you become more comfortable with the syntax 1b. Rolling your own Python environment in Windows is kind of finnicky which is why people suggested you download and use Anaconda. Anaconda is basically a stand-alone Python distribution that comes with many other useful packages. This means that you don't have to figure out the dependencies yourself; Anaconda has done that for you, and if you need something else then you can probably add it with conda. If you have anaconda and another python installation then you could be accidentally updating one and not the other, which would be confusing and frustrating and I suspect that's what you've been experiencing. I'd suggest removing everything, installing just anaconda, and then installing whatever additional packages you need with either conda (in case anaconda already knows of the package) or pip (in the rare cases where it doesn't, you can probably still get the package with pip, and you still won't need to worry about dependencies yourself; pip and conda both do this for you). 2. Tigren's link is good, Googling around can also get you more resources. Here's a cheat sheet for converting R commands to Python commands (or Matlab commands in case you have Stockhom's Syndrome or are forced to use MathWorks products) http://mathesaurus.sourceforge.net/matlab-python-xref.pdf Here's a cheat sheet for just Machine Learning algorithms: https://www.analyticsvidhya.com/blog/2015/09/full-cheatsheet-machine-learning-algorithms/ I looked up rpy2, and it sounds like that's a package that lets you use R objects in Python? That sounds neat. It looks like rpy2 is recognized by conda, so you can just install it with the command shown at this page: https://anaconda.org/r/rpy2
|
# ¿ Feb 11, 2017 04:40 |
|
Boris Galerkin posted:I wanna use virtualenv like all the cool kids but one of the python packages/libraries I'm using isn't managed by pip. It's something I need to compile myself with cmake and is linked against the systemwide Intel MKL and other dependencies. It doesn't show up when I type "pip list" but is importable/usable. Your virtualenv has its own pip, distinct from the system pip; if you use your virtualenv's pip to install the package that you've compiled, then the package will only be installed to that virtualenv and nowhere else. Installing the package depends on how it's setup; if it has a setup.py or a requirements.txt, etc. But basically you can use pip to install local packages as well as remote packages
|
# ¿ Feb 19, 2017 13:21 |
|
pangstrom posted:Just as something that bothers me continuously (much like factors in R), is there a nice under-the-hood or mathematical reason why when subsetting lists or numpy arrays the last number is excluded? (e.g., mylist[1:3] means the 2nd and 3rd element, but not the 4th) I'm fine with zero-indexing for some reason but that one fucks me up on the regular. I think it was just an arbitrary design choice, but it does make for some nice symmetries for things like half-open indices. Suppose you are splitting a string at 2 different points: Python code:
It also means that the length of a slice is equal to the difference of its indices; x[3:5] has a length of 2
|
# ¿ Feb 20, 2017 02:54 |
|
pangstrom posted:I guess that's as good a reason as I could have hoped for. Did/have other languages make/made that design choice? I know that Go does it the Python way, but Fortran, Matlab, and Julia do it the other way, where the upper bound is inclusive. There are advantages and disadvantages to both approaches, much like 1-based vs 0-based indexing
|
# ¿ Feb 20, 2017 03:26 |
|
I'm with Nippashish; ConfigParser feels clunky. If you just want to define a bunch of parameters that are used across a project, placing them in a config.py and importing them as-needed is elegant and simple. Wrapping them in a dict-like implementation and then importing that is good, too. ConfigParser is designed for projects that want users who are familiar with Windows ini files and who are afraid of opening .py files to be able to modify configurable parameters, with built-in error checking. That's a pretty narrow scope. If your users fall out of that scope, or if you don't really have many or any users, then ConfigParser adds nothing on top of feeling lovely to use. The error-checking is nice if you're worried about users really loving things up and being unable to recover
|
# ¿ Feb 25, 2017 21:37 |
|
huhu posted:Could you guys give me an example of what your settings.py files my look like? Also, does the dot notation you use autofill in something like Pycharm? I'd like to be able to type my_config.a and have all the settings with a pop up. Perhaps I'm going about this wrong. Here's maybe the most straightforward example Python code:
Python code:
Python code:
Python code:
|
# ¿ Feb 26, 2017 02:02 |
|
Nippashish posted:Any reason you can't use ffmpeg or avconv? Yeah ffmpeg should do a flawless job and is made for doing this sort of thing. Invoke it with Popen if you want. I don't know what your inherited cv2 implementation looks like, Zero Gravitas, but opencv uses ffmpeg libraries for creating movies and the cv2 movie-writing API is pretty straightforward. It shouldn't be more than 20ish lines. Maybe just rewrite it the correct way? Is there something funky about your inputs, like they sometimes come in with different sizes or something?
|
# ¿ Mar 5, 2017 05:37 |
|
For some reason Enthought (their product Canopy is basically like Anaconda) only supports Python2. I've asked their sales people in the past why they haven't moved to Python3 and their answer was basically "we think scientists still prefer and use 2" I don't think that it's political any longer, now it's just personal preference / legacy vs new code
|
# ¿ Mar 7, 2017 19:40 |
|
|
# ¿ May 8, 2024 16:42 |
|
The point about Python2 coming packaged with many linux distros is also a good one. It's a little surprising, but there you have it. Someone starts off learning with their system python and eventually when they start looking at package managers like Anaconda the question of "do I use Python2 or Python3" has already been answered for them
|
# ¿ Mar 7, 2017 23:41 |