Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
QuarkJets
Sep 8, 2008

HDF basically obsoletes the CSV format. If you're using tons and tons of CSVs, consider switching to HDF5. The benefits are innumerable; your files will be way smaller, way faster to read and write, and way more organized (an HDF5 file is like a little file system for your data that you can organize however you want, and compression is transparent and effortless). The downside is that you can't just poo poo out a ton of numbers into a text file

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Nippashish posted:

And that you need a library to read them. Don't underestimate the value of just being able to poo poo out data and be confident that even a braindead monkey can read it into whatever snowflake system they want.

Yeah, and that's why CSV is going to stick around forever. On the bright side I haven't used a language that doesn't have a really well-developed HDF5 library. For Python, h5py comes with all of the package managers by default and lets you treat everything like either a dict or a numpy array. And since it's all C under the hood and you're not reading a bunch of ASCII characters out of a text file, and because you only load into memory the arrays (or parts of arrays) that you asked for, it's fast as gently caress

Thermopyle posted:

I wonder how well passing around a sqlite database file would work. Because, man, SQL is awesome. I don't really know how well the performance would work out if you need to slurp it all into memory or something...

I've tried this, the performance was not great on either end so I went back to using HDF5

(I use a MySQL database for all sorts of stuff but HDF5 works better for larger arrays; and in a few weird cases I have used dark magicks ODBC to access data in an HDF5 file using an SQL query!)

QuarkJets fucked around with this message at 14:15 on Jan 25, 2017

QuarkJets
Sep 8, 2008

I hate getting assigned to a project and discovering that they wrote their own half-baked daemon and gave it a special service account just to run a process that has to execute every 10 minutes

QuarkJets
Sep 8, 2008

Eela6 posted:

Thanks! I picked up those tricks from Luciano Ramalho's Fluent Python and Brett Slatkin's Effective Python, respectively.

I am writing a speech for my local python developers group.

It's called 'The Elements of Style.' It's still in the works, but my rough draft is something like 80% complete. I would love advice on the slides. Take a look and let me know what you think!

http://tinyurl.com/sdpystyle

"Functions should have as few arguments as possible" gets introduced on slide 41 but it probably belongs in the list of Function rules on slides 21-26. This is basically just another function rule, is it not?

I think "needless identifiers" is a misnomer; _ is just a variable name like anything else. I understand that some people like to use _ to indicate "this variable is not used" but that's still an identifier. I'm not sure what to actually call it

QuarkJets
Sep 8, 2008

Nippashish posted:

I propose we rename 'zip' to 'as_tuples_by_index' because 'zip' is unnecessarily cryptic, and fails to be self-explanatory unless you already know what it's supposed to mean.

I think 'zip' is actually pretty simple to understand if you've ever seen or used a zipper before

QuarkJets
Sep 8, 2008

Nippashish posted:

zip is intuitive if you know what it means, but if you'd never encountered it before you might not know what it did just from the name. My argument is that _ is the same.

But only one of those references something in the real world. The other is an arbitrarily-decided convention

I'm not saying that _ is a bad convention, but your analogy is really weak

QuarkJets
Sep 8, 2008

It's also worth mentioning that _ is usually used as an alias for the gettext function so you're also committing the sin of shadowing a function with a variable, but _ is a terrible function name. Still, it's easy to get into trouble if you're working on code that uses this badly-named alias

Pylint regards any local variable with an underscore in front of it (such as _unused, _foo, _bar) as unused, which seems like it's the most Pythonic solution

Python code:
child, parent_, pie_plate = return_pie_state(person)
By convention, this code says that the parent is unused, but if someone came along later and wanted to use the parent then they easily could without having to check the API. That is both advantageous and more in accordance with "explicit is better than implicit" (even if it's not totally in accordance, merely moreso than using _)

Note that _ isn't mentioned in PEP8 anywhere, I think it's the kind of convention that just kind of spread by word of mouth and is generally safe to ignore for those who want to ignore it

QuarkJets
Sep 8, 2008

Thermopyle posted:

That's because it's a programming convention, not a Python convention. It's used in probably dozens of languages.


Because there's more cognitive overhead in reading a word vs recognizing an underscore. But mainly because everyone knows what a _ means, and then you come across some guy wanting to use "unused" and then you don't know wtf is going on. Did this guy really not know that everyone uses "_"? Maybe he did, so now maybe he means something different by "unused"?

Okay you're exaggerating though, right? If you saw someone assign something to a variable named "unused" surely you wouldn't clutch your head in confusion, instead you'd probably think "oh this variable must not be used by anything" regardless of your experience level

QuarkJets
Sep 8, 2008

baka kaba posted:

Maybe it's a boolean that contains information about whether something is used? Or a collection of unused things from a pool? Like, is the value you're assigning part of an object's state, or something you're going to be ignoring? You can make an assumption, but how sure are you without further investigation?

That's the nice thing about conventions, it gets everyone on the same page so you be pretty drat sure what's happening and what you can forget immediately. Thermo's making the point that when someone ignores a convention, you have to start asking why and you can't really rely on your assumptions anymore

Neither of those arguments are working for me, as they both also apply against "_" and for "unused" respectively

The "_" character is even less descriptive than "unused", and so it could store literally anything. Case in point, _ is already aliased to the gettext function, and in the interpreter it maps to the value returned by the previous command. Even if the convention is to store unneeded data in "_", that doesn't necessarily make it a good idea; using a different convention might be better

I've seen "unused" used to hold unused returns many times, so apparently that is a convention as well, although possibly it's just more commonly used in scientific computing than in other fields. But still, it's conventional, so your point about the positives of convention applies to it as well.

I'd like to re-raise the idea of always labeling returned variables descriptively like you would have done anyway but prefixing those names with "_" when they're unused. Pylint already supports this, it's more explicit, it's more readable, and it still carries the benefits of conventions. It seems like this is the best of all possible worlds, the only downside being that you have to type a few more characters, which seems pretty negligible.

QuarkJets
Sep 8, 2008

Eela6 posted:

The problem is that prefixing variables with a single underscore already means something by python convention. It's signifies a "protected" variable that's internal to the module and not meant to be used by a public consumer of the function.

That's actually consistent though; just as you shouldn't use a variable marked as unused, you also shouldn't use a variable marked as "protected". So that's not a problem.

You also shouldn't want to import an unused variable, so that's all the more reason to label unused variables with leading underscores! I'm sure that "_" is already captured in this rule, and switching to using "_" as a prefix character wouldn't change the behavior, which is good.

So... no downsides so far except "it's not already the convention"

QuarkJets
Sep 8, 2008

But explicit is better than implicit, right? Doesn't _ go against that?

QuarkJets
Sep 8, 2008

baka kaba posted:

Flat is better than nested :v:

Honestly a lot of those rules seem better served by _. Sparseness, readability, one way to do things (in the sense you don't need to think of a context-appropriate name to convey 'unused', that's what _ is actually for).

I guess you can argue it's less explicit than a name, but that goes for a lot of sugar and basic language features. It's a tool to pick up like anything else, once you know it it's about as explicit as you can get

That's not what explicit means, there is no way that _ could be considered more explicit than a descriptive variable name. And again any argument you make about conventions (which you're now calling "a tool") applies equally well to using _ as a prefix (which is already a convention/tool in pylint).

I'm not sure why you're against using _ as a prefix for marking unused variables, it's already an in-use convention, it's in accordance with the other ways that underscores are usually used, and it's friendlier to people on the outside who may have never seen the convention before (such as C programmers).

QuarkJets fucked around with this message at 13:19 on Feb 3, 2017

QuarkJets
Sep 8, 2008

Nippashish posted:

Its almost like clarity of different conventions depends on which conventions you're used to seeing and using yourself.

Nah, there's definitely a universal objective answer.

This. Even just in this thread we've seen several posters who had either not seen the "_ as unused" convention, had seen a similar but different convention, or who knew of the convention but don't use it because they think that a different convention is better. These all seem like okay states to be in and I don't see anything wrong with someone using an alternative if they think that's better, especially when that alternative is already recognized in pylint

We are talking about something that is usually inconsequential (variable name) that contains something that is definitely inconsequential (unused return value) so maybe it's just more important that a codebase is internally consistent than that it aligns with an unstated convention ported over from other languages?

QuarkJets
Sep 8, 2008

Thermopyle posted:

I think this is the first time this argument has been brought up against the "_" in this discussion which seems odd if it's the best argument or right-est argument.

I also don't think it's an accurate characterization of the ubiquity of '_', nor accuracy of pylint. But pinning down both of those things is the only way to go forward with this conversation, and I certainly don't think that's an easy thing to quantify, I don't guess there's anywhere else to go because I certainly am not interested in spending the time quantifying the ubiquity of '_', nor whether pylint's rules say anything about that ubiquity.

Regardless, I don't think anyone would argue against staying consistent with an already existing codebase...using a style different from the existing codebase is even worse than not using '_'.

:ughh:

Pylint recognizes other conventions as being okay, why won't you?

QuarkJets
Sep 8, 2008

Thermopyle posted:

I'm not sure you understand what you highlighted in my quote means?

I mean...using your own style in a code base is bad and worse than using a less popular convention for an unused variable. I'm not sure what's controversial about that.

In the text that I highlighted you're subtly implying that using _ for unused variables is objectively better than other conventions. It's the "vim vs emacs" of variable naming; both options are fine and if you're starting a new codebase then it's fine to use whatever you want

QuarkJets
Sep 8, 2008

Fusion Restaurant posted:

What IDEs do people like for doing data stuff in Python? I've been just using Spyder or Jupyter notebooks pretty much because they were the default options w/ Anaconda.

Has anyone found that a different IDE has some particular benefits?

Data scientist checking in

PyCharm is not really made with data analysis in mind but it is still a really great IDE. It's what I use when it's available and when I'm not being lazy/opening Vim for a quick thing.

Spyder is solid but debugging in it is a real chore compared to PyCharm. I think it also doesn't have native version control and that its code refactoring tools are per-file rather than per-project, but I could be wrong; I haven't used it in a long time. I think it's only as popular as it is because it comes with Anaconda and PyCharm doesn't, so it's sort of the path of least resistance.

Jupyter notebooks are cool but I don't really use them, so I can't comment on their effectiveness as an IDE. Creating a new notebook felt clunky and weird when I was playing around with it. However, I do know that PyCharm lets you open Jupyter notebooks, so it seems like you should be able to have the best of both worlds if a notebook suits your needs

QuarkJets
Sep 8, 2008

Rosalind posted:

My questions now are: 1. How the heck does anyone ever learn how to use this? I'm the first to admit that I'm no programming guru, but I have some experience with several different languages and actually getting started with Python makes absolutely no sense. Everything feels like the biggest clusterfuck of dependencies and "install X, Y, and Z to get W to work but to get X to work install A and B which require C to run on Windows." 2. Is there a simple, idiot-proof (because that's what I am apparently) guide to moving from R to Python for data analysis and hopefully eventually machine learning?

1a. Learning Python is like learning any other language, including R. Use it with whatever project you need to currently do, work through issues, ask questions on the internet, ask about best practices as you become more comfortable with the syntax

1b. Rolling your own Python environment in Windows is kind of finnicky which is why people suggested you download and use Anaconda. Anaconda is basically a stand-alone Python distribution that comes with many other useful packages. This means that you don't have to figure out the dependencies yourself; Anaconda has done that for you, and if you need something else then you can probably add it with conda. If you have anaconda and another python installation then you could be accidentally updating one and not the other, which would be confusing and frustrating and I suspect that's what you've been experiencing. I'd suggest removing everything, installing just anaconda, and then installing whatever additional packages you need with either conda (in case anaconda already knows of the package) or pip (in the rare cases where it doesn't, you can probably still get the package with pip, and you still won't need to worry about dependencies yourself; pip and conda both do this for you).

2. Tigren's link is good, Googling around can also get you more resources. Here's a cheat sheet for converting R commands to Python commands (or Matlab commands in case you have Stockhom's Syndrome or are forced to use MathWorks products)
http://mathesaurus.sourceforge.net/matlab-python-xref.pdf

Here's a cheat sheet for just Machine Learning algorithms:
https://www.analyticsvidhya.com/blog/2015/09/full-cheatsheet-machine-learning-algorithms/

I looked up rpy2, and it sounds like that's a package that lets you use R objects in Python? That sounds neat. It looks like rpy2 is recognized by conda, so you can just install it with the command shown at this page:
https://anaconda.org/r/rpy2

QuarkJets
Sep 8, 2008

Boris Galerkin posted:

I wanna use virtualenv like all the cool kids but one of the python packages/libraries I'm using isn't managed by pip. It's something I need to compile myself with cmake and is linked against the systemwide Intel MKL and other dependencies. It doesn't show up when I type "pip list" but is importable/usable.

How can I use virtualenv --no-site-packages to get a clean environment, but with this one package included/available? Can I just symlink something to my virtualenv folder (and what)?

Your virtualenv has its own pip, distinct from the system pip; if you use your virtualenv's pip to install the package that you've compiled, then the package will only be installed to that virtualenv and nowhere else.

Installing the package depends on how it's setup; if it has a setup.py or a requirements.txt, etc. But basically you can use pip to install local packages as well as remote packages

QuarkJets
Sep 8, 2008

pangstrom posted:

Just as something that bothers me continuously (much like factors in R), is there a nice under-the-hood or mathematical reason why when subsetting lists or numpy arrays the last number is excluded? (e.g., mylist[1:3] means the 2nd and 3rd element, but not the 4th) I'm fine with zero-indexing for some reason but that one fucks me up on the regular.

I think it was just an arbitrary design choice, but it does make for some nice symmetries for things like half-open indices. Suppose you are splitting a string at 2 different points:
Python code:
x = 'this is a string'
i=3
j=7
a = x[:i]
b = x[i:j]
c = x[j:]
If the upper-bound was inclusive, then you'd have to do something more complicated with the indexing in the example above.

It also means that the length of a slice is equal to the difference of its indices; x[3:5] has a length of 2

QuarkJets
Sep 8, 2008

pangstrom posted:

I guess that's as good a reason as I could have hoped for. Did/have other languages make/made that design choice?

I know that Go does it the Python way, but Fortran, Matlab, and Julia do it the other way, where the upper bound is inclusive.

There are advantages and disadvantages to both approaches, much like 1-based vs 0-based indexing

QuarkJets
Sep 8, 2008

I'm with Nippashish; ConfigParser feels clunky. If you just want to define a bunch of parameters that are used across a project, placing them in a config.py and importing them as-needed is elegant and simple. Wrapping them in a dict-like implementation and then importing that is good, too.

ConfigParser is designed for projects that want users who are familiar with Windows ini files and who are afraid of opening .py files to be able to modify configurable parameters, with built-in error checking. That's a pretty narrow scope. If your users fall out of that scope, or if you don't really have many or any users, then ConfigParser adds nothing on top of feeling lovely to use. The error-checking is nice if you're worried about users really loving things up and being unable to recover

QuarkJets
Sep 8, 2008

huhu posted:

Could you guys give me an example of what your settings.py files my look like? Also, does the dot notation you use autofill in something like Pycharm? I'd like to be able to type my_config.a and have all the settings with a pop up. Perhaps I'm going about this wrong.

Edit: I'm probing because the project I'm trying to work on might be useful for others. Probably not but I'd like the practice of making my stuff more user friendly.

Here's maybe the most straightforward example

Python code:
#settings.py
do_count_butts = True
Python code:
# some other file in the project
from settings import do_count_butts

#...

if do_count_butts:
    print 'it seems that we are counting butts now'
Pycharm will definitely autocomplete if you do it this way. It will also autocomplete if you just add properties to a class:

Python code:
#settings.py
class Params:
    do_count_butts = True
Python code:
# some other file
from settings import Params
if Params.do_count_butts:
  # etc
The world is your oyster

QuarkJets
Sep 8, 2008

Nippashish posted:

Any reason you can't use ffmpeg or avconv?

Yeah ffmpeg should do a flawless job and is made for doing this sort of thing. Invoke it with Popen if you want.

I don't know what your inherited cv2 implementation looks like, Zero Gravitas, but opencv uses ffmpeg libraries for creating movies and the cv2 movie-writing API is pretty straightforward. It shouldn't be more than 20ish lines. Maybe just rewrite it the correct way? Is there something funky about your inputs, like they sometimes come in with different sizes or something?

QuarkJets
Sep 8, 2008

For some reason Enthought (their product Canopy is basically like Anaconda) only supports Python2. I've asked their sales people in the past why they haven't moved to Python3 and their answer was basically "we think scientists still prefer and use 2"

I don't think that it's political any longer, now it's just personal preference / legacy vs new code

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

The point about Python2 coming packaged with many linux distros is also a good one. It's a little surprising, but there you have it. Someone starts off learning with their system python and eventually when they start looking at package managers like Anaconda the question of "do I use Python2 or Python3" has already been answered for them

  • Locked thread