Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
SurgicalOntologist
Jun 17, 2004

Personally, even though I use miniconda I don't put its main bin folder on my path. I symlink activate and conda into my path, and then never use the root environment. So I don't get conda stuff on my path until I explicitly activate an environment.

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

QuarkJets posted:

What if you're deploying code to multiple users on a common system? I usually write a shell wrapper that inline modifies the path and ld_library_path to use anaconda, if it's necessary (such as if my application links against the specific version of pythonxx.so that comes with anaconda).

Things can also become very complicated if your project is using something like cmake to specify your Python area

I've only had to face compilation issues with Anaconda a couple of times but I solved it by keeping everything inside Anaconda. Installed all the dependencies as Anaconda packages and then made a conda recipe for the actual compilation. Then you can distribute the precompiled conda package.

SurgicalOntologist
Jun 17, 2004

Nice! Here's mine:

Python code:
from inspect import signature


def call_repr(name, *args, **kwargs):
    args_repr = ', '.join(repr(arg) for arg in args)
    kwargs_repr = ', '.join('{}={!r}'.format(*kwarg) for kwarg in kwargs.items())
    joiner = ', ' if args_repr and kwargs_repr else ''
    return '{}({}{}{})'.format(name, args_repr, joiner, kwargs_repr)



def repr_as_initialized(cls):
    cls._orig__init__ = cls.__init__
    
    def __init__(self, *args, **kwargs):
        self._bound_args = signature(cls).bind(*args, **kwargs)
        self._orig__init__(*args, **kwargs)
    
    def __repr__(self):
        return call_repr(cls.__name__, *self._bound_args.args, **self._bound_args.kwargs)
    
    cls.__init__ = __init__
    cls.__repr__ = __repr__
    return cls


# tests
@repr_as_initialized
class ReprTester:
    def __init__(self, a, b, c=1):
        pass

assert ReprTester.__name__ == 'ReprTester'
assert repr(ReprTester(1, 2)) == 'ReprTester(1, 2)'
assert repr(ReprTester(1, 2, 3)) == 'ReprTester(1, 2, 3)'
assert repr(ReprTester(1, 2, c=3)) == 'ReprTester(1, 2, c=3)'

SurgicalOntologist
Jun 17, 2004

Ah, I see. I tried to fix it by replacing cls with type(self) in __init__ and __repr__, but you can't get it to work with an overridden __init__ without using __new__. Nice solution.

Edit: although, as long as proper subclass hygiene is used and the new init starts with a super().__init__ call, it should work.

SurgicalOntologist fucked around with this message at 04:49 on Mar 10, 2017

SurgicalOntologist
Jun 17, 2004

You can also avoid the issue in several ways.

My favored solution would be to turn the stochastic process into a generator then use it like this:
Python code:
value = next(v for v in stochastic_process_generator() if meets_conditions(v))
Or you could use break:
Python code:
while True:
    value = stochastic_process():
    if meets_conditions(value):
        break
Otherwise, I would use None as was suggested and avoid the problem you bring up by prefacing the conditions with value is not None.

SurgicalOntologist
Jun 17, 2004

Not sure if I should put this here or in the Scientific Computing thread, but whatever.

I have a data analysis project that involves computing a series of variables on a dataset (specifically, I'm adding DataArrays to an xarray DataSet, but it's the same idea as if I were adding columns to a DataFrame). Each calculation requires some variables to already be available and adds one or more variables to the dataset. Some of these are expensive to calculate and may not be necessary to run in all circumstances so I don't want to run everything. In short, I want a way to handle this dependency graph.

This seems like it should be a common pattern but I don't know of an existing easy solution. Closest thing I can think of is dask's DAG compute engine but this doesn't really fit into that model since I'm enriching the dataset object rather than passing around values.

If there's no existing solution the the best option I can think of is to have something like a require method call at the top of every function that checks if the required elements are present and then runs the corresponding function if they are not. Any better ideas?

SurgicalOntologist
Jun 17, 2004

Ah yes, that's the way it should work. But it is complicated slightly by the fact that these aren't attributes but rather DataArrays within a DataSet (analagous to columns in a DataFrame). And xarray does the same thing pandas does to allow these variables to be set with __setitem__ but retrieved with __getitem__.

I could subclass DataSet and make my lazy variables true properties...but I'd probably have to override the name collision detection error.

Or I could make my own get function:
Python code:
    def get(self, var_name):
        if var_name not in self._dataset:
            getter_name = '_' + var_name
            if not hasattr(self, getter_name):
                raise KeyError(var_name)
            self._dataset[var_name] = getattr(self, getter_name)()
        return self._dataset[var_name]
I just realized I might have other issues stemming from the fact that the getter methods have some **kwarg parameters and I may want to maintain multiple copes of each... but this not be a common enough thing to plan for.

SurgicalOntologist
Jun 17, 2004

Hmm seems like this is more complicated. The methods I need to be calling aren't really getters but assigners, because sometimes coordinates need to be fixed or other implementation details when assigning a new data array. Also some of them assign multiple arrays.

For example see these two methods:
Python code:
    def assign_du_dt(self, name, u, t, fix_coords=None, **kwargs):
        """Fixes time coordinate issue with ``ds.assign(name=du_dt(u, t, **kwargs))``.

        Parameters
        ----------
        name : str
        u : str or xr.DataArray
        t : str or xr.DataArray
        fix_coords : list of str
            Default is ``['clock', 'seconds', 'half']``.
        **kwargs

        Returns
        -------
        xr.Dataset

        """
        if isinstance(u, str):
            u = self._ds[u]
        if isinstance(t, str):
            t = self._ds[t]
        if fix_coords is None:
            fix_coords = ['clock', 'seconds', 'half']

        return (
            self._ds
                .assign(**{name: u.s.d_dt(t, **kwargs)})
                .assign_coords(**{coord: self._ds[coord] for coord in fix_coords if coord in self._ds})
        )

      def assign_polarized(self, u, name=None, angle_name=None, dist_name=None):
        """Shortcut for ``ds[name + '_angle'], ds[name + '_dist'] = u.s.polarize()``

        Parameters
        ----------
        u : str or xr.DataArray
        name : str, optional
            Default value is ``u.name``.

        Returns
        -------
        xr.Dataset

        """
        if isinstance(u, str):
            u = self._ds[u]
        if name is None:
            name = u.name
        if angle_name is None:
            angle_name = name + '_angle'
        if dist_name is None:
            dist_name = name + '_dist'

        self._ds[angle_name], self._ds[dist_name] = u.s.polarize()
        return self._ds
My lazy evaluation system would need to do know to call assign_du_dt('velocity', 'position', 'time') if velocity is missing and assign_polarized('velocity', angle_name='heading', dist_name='speed') if heading or speed were missing....

Not sure that coming up with a general solution will be worth it. Maybe I could make the lazy properties work with composition instead of subclassing...

SurgicalOntologist
Jun 17, 2004

Nope, you shouldn't need to figure out the html variable. response.text in your second block is the same as html.read() in your first. In other words, scrapy gets the html for you and passes it in as response.

That is, if I'm understanding the scrapy API correctly based on what you posted.

SurgicalOntologist
Jun 17, 2004

Yeah, you probably want to (a) decide if you want to write a new CSV for each page that is scraped, or just append to a single file; and (b) call writerow more than once (i.e. put it inside a loop).

SurgicalOntologist
Jun 17, 2004

KingNastidon posted:

Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using:

code:
for x in range(0, nummonths):
    for y in range(x, nummonths):
        cohorts.iloc[x,y] = data.ix[x,'nps'] * math.pow(data.ix[x,'duration'], y-x)
Calculating this n by n array with n>500 is instantaneous in VBA, but is taking 10+ seconds in pandas with jupyter. Any thoughts on what makes this so slow in pandas or better way to go about this?

I don't fully understand the algorithm but the first step is to try to do this with vectors operation. This might not work as written but may only need small tweaks. At least it should illustrate the idea of vectorizing the operation.

Python code:
x, y = np.meshgrid(range(0, nummonths), range(0, nummonths))
cohorts = data['nps'] * data['duration']**(y - x)

SurgicalOntologist
Jun 17, 2004

Edit: nevermind I see what you're doing. Carry on.

SurgicalOntologist fucked around with this message at 00:48 on Apr 8, 2017

SurgicalOntologist
Jun 17, 2004

Cingulate posted:

What's the point of the '0's here? Is this best behavior somehow?

You're right, seeing this again I would do it without the 0s.

SurgicalOntologist
Jun 17, 2004

Cingulate posted:

Different question: does Continuum make money? What are the chances "conda install all_the_software_i_need" won't work in 2018 because Travis Oliphant has to choose between making it slightly easier for me to set up numpy or feeding his kids?

The nice thing about a heavily used open source project is that it cannot really go away. If Continuum disappears the code will still exist and there will be enough volunteers to maintain it.

SurgicalOntologist
Jun 17, 2004

can you do something like this:


Python code:
class NavierStokes:
    def __init__(self, ..., nu_method='constant'):
        ...
        self.nu = getattr(self, '_nu_' + nu_method)

    def _nu_constant(self, ...):
        ...

    def _nu_full(self, ...):
        ...
although I still don't understand the logic behind not wanting to assign it as an attribute. Because otherwise you could just pass it to __init__.

SurgicalOntologist
Jun 17, 2004

That's reasonable too. In that case I would just do

Python code:
def __init__(self, ..., nu_func=default_nu_func):
    self.nu = nu_func
The only reason I didn't suggest that is this:

Boris Galerkin posted:

e: I want the method injected to be bound to the object so I can't just assign it as an attribute.

which I don't follow. With the above it is bound to the object.

SurgicalOntologist
Jun 17, 2004

Boris Galerkin posted:

Maybe I got my terminology wrong. I thought bound meant the thing where it implicitly gets "self" passed to it.

Eela6 posted:

You don't need to do any special magic to create a method and attach it to an object. So long as the first parameter is self it should 'just work'. Phoneposting so i cant go into more detail, but if you want an example I do so in my easyrepr implementation from earlier in this thread.


That's what I thought would happen (it would get self passed automatically), but I just checked and Boris is right, it doesn't. Huh. This is the first time in a while I've been surprised by Python. I guess you do need to do that thing with the types module (or use the first method I suggested).

Edit: Eela6, IIRC about your easyrepr thing, it attaches to the class rather than an instance. In that case it works as expected.

SurgicalOntologist
Jun 17, 2004

Yes it will.
Python code:
In [1]: class Test:
      :     pass
      : 
      : 

In [2]: def foo(self):
      :     print('bar')
      :     

In [3]: a = Test()

In [4]: Test.f = foo

In [5]: a.f()
bar

In [6]: b = Test()

In [7]: b.f()
bar
This is the kind of thing I've done or seen done before, it didn't cross my mind that it wouldn't work with instances.

SurgicalOntologist
Jun 17, 2004

laxbro posted:

Newbie question: I'm trying to build a web scraper with Beautiful Soup that pulls table rows off of the roster pages for a variety of college sports teams. I've got it working with regular HTML pages, but it doesn't seem to work with what appear to be Angular pages. Some quick googling makes it seem like I will need to use a python library like selenium to virtually load the page in order to scrape the html tables on the page. Would a work around be to first use Beautiful Soup, but if a table row returns as None, then call a function to try scraping the page using something like selenium. Or, should I just try to scrape all of the pages with selenium or a similar library?

Load the page with Chrome deveoper tools open. The Network tab, I think, is where you can inspect requests. You might be able figure out how the page is fetching the data. I've been in a similar situation and found that the site was making a request to some JSON API before loading the data. In developer tools you can see what the request headers and response looks like. In the easiest case you can just hit the API url and get JSON back. In a more complicated case you can examine the request headers and mimic them in your scraper.

SurgicalOntologist
Jun 17, 2004

x is the element (in your first code chunk), to get its data-symbol attribute, do x.attrs['data-symbol']. Using attrs in find_all would be for the use case of finding an element that had a specific data-symbol. Or, you could do something along those lines to get all the elements that have that attribute at all. But if you are already able to get the element itself another way, it's straightforward to grab any attribute.

SurgicalOntologist
Jun 17, 2004

Looks like it expects redis to be running. Install redis and run it.

SurgicalOntologist
Jun 17, 2004

Baby Babbeh posted:

That... makes sense. I was overcomplicating this. It returns another df rather than changing it in place, right?

As Cingulate mentioned, that would be an in-place function call. But I thought I would chime in to show how to do it as a function returning a new df:

Python code:
new_df = df.where(df == 0, 1)
The semantics are kind of reversed, think of it like "keep the original where the condition is true, otherwise change the value", so the above is the equivalent of Cingulate's example df[df != 0] = 1.

Anyways, DataFrame.where (or more commonly for me Series.where) is one of the most common operations when I'm using method chaining.

SurgicalOntologist
Jun 17, 2004

Malcolm XML posted:

Perfectly fine if your algorithms are insensitive to things like that (i use floats in quant finance optimization models) but try simulating a chaotic system and watch as floats are completely useless

It gets better when you have no idea how chaotic your system is!

Umm I simulate chaotic systems... what should I be using if not floats?

SurgicalOntologist
Jun 17, 2004

Malcolm XML posted:

a complicated algorithm that compensates for the fact that the system is very sensitive to accumulated approximation error and a lot of whiskey when b/c even then long term behavior is sketchy
Do you have an actual algorithm in mind? Because if so I'd be curious to read up on it. I always figured that after solver issues (i.e. discretization error, stiffness, etc.) and measurement error on initial conditions, floating-point arithmetic is pretty low on the list of concerns.

But yeah, dealing with chaotic systems, good luck if you need to make specific predictions. Investigating the qualitative behavior of the system is more useful anyway.

In any case, you started by saying "floats are completely useless" for simulating complex systems, which I still don't understand--should I not be using floats? Or are just making the point that chaos magnifies errors? In which case, I have errors of much higher orders of magnitude I'm already concerned about.

SurgicalOntologist
Jun 17, 2004

Specifically, you should be okay using scipy.integrate.odeint and not even thinking about the kind of solver it uses under the hood. You just need to formulate your problem in terms of a function that takes the state as an input vector and outputs a vector of the rates of change. It could easily be 1000x faster or more than doing Euler method by hand.

Edit: and inside said function you should use dot product or matrix multiplication (via numpy) for all that multiplying and adding.

SurgicalOntologist fucked around with this message at 19:15 on Jun 17, 2017

SurgicalOntologist
Jun 17, 2004

The function you posted, which you determined by profiling is slowing your code down, is an implementation of the Euler method (the whole 1-second steps thing) for solving differential equations. So maybe you're also doing other things, but you absolutely are solving differential equations.

And by the way, your bigger problem is interesting to me. I'm writing my dissertation on estimating parameters of differential equations from observed timeseries (if this makes sense: it's a hierarchical method in which the diff eq parameters are regression DVs. Allowing one to test hypotheses like "the experimental manipulation will significantly affect parameter x"). Anyways, it sounds like what you're doing would fit into my framework. It's not ready, so I'm not going to suggest you use it, but your problem sounds like something I could include in my intro or conclusion when I list the applications in various domains. When you get a chance, could you share in general terms the problem you're trying to solve, and the approach you're taking? I'll probably have follow-up questions so maybe a PM would be more appropriate so we don't clog up the thread.

SurgicalOntologist
Jun 17, 2004

So just to clarify, you want to find the parameters that best reproduce the timeseries in the data?

If so, this is a real tricky problem (and yes, you are solving differential equations along the way.. these are the simulations). I'd be surprised if you have any luck with the local optimization methods in scipy, and genetic algorithms are a rabbit hole I also went down, to lots of time spent for little reward. The state of the art is AFAIK this method. My dissertation is basically expanding it to the case where you have multiple data series and want to characterize how they their parameters vary. Which seems to be exactly what you want to do. Unfortunately I don't have an open-source package ready yet for the general case.

On the other hand... I jumped straight to assuming you have a nonlinear, chaotic system, and I didn't look closely enough at the equations to see if that's the case or not. If you have a linear dynamical system you probably don't even need to solve it numerically and parameter estimation should be straightforward.

But, if you do have a nonlinear system, and don't feel like implementing Abarbanel's method or tracking down his MATLAB implementation, you are left with something like what you are doing, what I call a "black box" parameter estimation method. Choose a parameter set to test, run the simulations, compute the error. In which case we're right back to where we started, regardless of what kind of optimization you end up doing over it, your immediate problem is speeding up the simulations. And the answer here is what was suggested before: formulate the system as a function which takes a state vector (i.e. numpy array) and parameters as input and outputs the rates of change of the state vector, and send it to scipy.integrate.odeint. If you can't figure it out from the scipy docs you should be able to find more examples online, or post back for help.

Once that's fast I would start by testing parameter sets (in your case, IIUC, you are testing "super-parameter" sets in that they are parameters of the equations that determine the actual diffeq parameters. Or you are testing the actual equations rather than just their parameters. Either way, the same recommendations) more or less manually. Run some examples by hand to get a sense of how the errors behave. Do some brute force testing (i.e., test every single permutation over a range of the parameters--if you have a bunch it may have to be an incredibly coarse range, but still worth it) and plot the errors in various ways. Spend some time doing this kind of thing before you start looking into optimizers, because that's a whole can of worms and optimizers often don't respond well to "black box" type problems (where you can't tell it the Jacobian of the error function). "Getting a feel for it" will get you pretty far and help you to understand what the optimizers are doing.

And again, if you do have a linear system, or use a method like Arbanel's to (sort of) lineraize your system, then you can determine the Jacobian and a basic optimizer will do the job.

Good luck! You've picked a problem that is much harder than it first appears.

SurgicalOntologist fucked around with this message at 17:46 on Jun 19, 2017

SurgicalOntologist
Jun 17, 2004

code:
df.groupby(df.index.dayofyear)
?

SurgicalOntologist
Jun 17, 2004

I think with pandas you could do

Python code:
word_values = pd.Series(values_dict).sort_index()
for words in word_lists:
    list_min = word_values[words].min()
    ...
Not sure how much more performant it would be, if any.

SurgicalOntologist
Jun 17, 2004

If I were you, I would get it working however you can, then work on improving/extending it. You shouldn't have to use iterrows or iteritems very often if at all. If you are replacing lookups, you can think of indexing into a pandas object (e.g. prices[items]) as a vectorized lookup--so you don't need to do them one at a time.

If there's a specific thing you can't figure out how to do without iterating, post it (or a simplified example or whatever), those are fun challenges and I think transforming code from one style to another is a good way to learn.

SurgicalOntologist
Jun 17, 2004

Slimchandi posted:

For example of what I mean by excel versus python issue, there are calculations of customer losses in October that depend on the total customer numbers in September. Only then can you get total customers in October, which can be used to find losses in Nov etc. This kind of iterative approach makes sense to me in Excel, but in pandas I am used to calculating each series at a time, rather than two interdependently.

For this kind of thing the trick is to "factor out" the inter-dependent aspects of the calculation. For example if you can first predict "percent customers lost", e.g. based on seasonality factors, then from there it's straightforward to build a series of actual customers lost or total customers or whatever you'd like. It will probably also be useful to index your data not by the actual month but by "months into the future", e.g. [0, 1, 2...]. You can always change the labels when you are visualizing/displaying the data.

SurgicalOntologist
Jun 17, 2004

conda has a --file argument that is equivalent to the -r in pip. I tend to handle it by having one file conda-requirements.txt and another pip-requirements.txt. I've never had an issue with that setup.

I would not attempt to copy over the actual environment, rather the instructions to create it, e.g. the requirements files and maybe a bash script (it can even start from scratch and include curling the miniconda install script). It will be much easier to maintain, easier to transfer, and less prone to weird errors. Just remember to keep it up to date.

Regarding the MultiIndex question, the solution would be to just have the same value for all rows of that type. "both"?

Spans aren't really a thing in pandas, because spans are more about displaying data than organizing/manipulating it. If you do want to display the data using spreadsheet concepts, you may be able to convert to excel and then continue to manipulate it from Python using one of the excel libraries. pyxl might be one, I don't recall.

SurgicalOntologist fucked around with this message at 15:37 on Sep 15, 2017

SurgicalOntologist
Jun 17, 2004

Boris Galerkin posted:

I don't have internet access on some of these machines.

My bad, I missed that.

In that case I would say just copy over the entire miniconda folder. Everything you need is in there.

SurgicalOntologist
Jun 17, 2004

Daviclond posted:

Why does this one-liner:

code:
return my_list.reverse()
return None, but over two lines this:
code:
my_list.reverse()
return my_list
returns the reversed list as intended?

I thought returning code which had to be evaluated (e.g. "return x == y") was fine, so I'm clearly missing something here. Is there a better way to do it?

It's just because list.reverse is implemented as an in-place operation, changing an existing list (and returning nothing) rather than returning a new one. I would also prefer if it was the latter, but for many of the manipulation methods on built-in data structures, they are in-place operations.

SurgicalOntologist
Jun 17, 2004

Boris Galerkin posted:

I tried copying my entire miniconda folder instead except this doesn't work either because the path miniconda is installed to is hardcoded or something, so it can't be changed.

I don't recall this being the case; the only exception I know of is the change to the PATH environment variable that the installer makes.

Even if I'm wrong on that, can't you just install it to where you want it to end up on the first machine? Or symlink? Of the roadblocks you encountered, this seems like the easiest to solve.

SurgicalOntologist
Jun 17, 2004

Matthew Rocklin has some great talks demonstrating the use of toolz. Check vimeo and YouTube.

SurgicalOntologist
Jun 17, 2004

I'm banging my head against the wall on a numpy/pandas float issue. I have a DataFrame of simulation results (or observations, doesn't matter). There is a floating point column "time", that would be convenient as an index but because floating-point indexes are difficult I am using timestep (integer) as the index and keeping time as a regular column. However, a common operation is looking up the index at a particular time. Because time is a float it can't be done the easy way (this is the same reason not to use time as the index). Anyways, here is my attempt:
Python code:
def index_at_time(time_series, time_values):
        return time_series.searchsorted(time_values)
Here time_series is a pandas Series but if you're only familiar with numpy this is equivalent to np.searchsorted(time_series, time_values).

Here are my test:
Python code:
@given(floats(min_value=0.5, max_value=10))
def test_index_at_time_vector_input(mul):
    time_series = pd.Series(np.arange(0, 2, 0.1))
    time_steps = np.arange(20)
    time_values = time_steps * mul / 10
    time_values /= mul
    np.testing.assert_allclose(index_at_time(time_series, time_values), range(20))
I am using hypothesis to feed in floats in order to mess with the floating-point representation without actually changing the underlying values.

Unfortunately, it fails consistently at time step 9:
code:
mul = 0.5000000000026954

>       np.testing.assert_allclose(index_at_time(time_series, time_values), range(20))
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       (mismatch 10.0%)
E        x: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 10, 11, 12, 13, 14, 15, 16,
E              17, 19, 19], dtype=int64)
E        y: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
E              17, 18, 19])
It's the same thing when I test the values one at a time, I get 0, ..., 7, 8, 10, 10, ... 19.

In short, I can't figure out a good way to reliably get the closest index with floats. I really don't want to do something like argmin(abs(time_series - time_values)) for a frequently used lookup function, but is that the only way?

SurgicalOntologist fucked around with this message at 17:47 on Nov 16, 2017

SurgicalOntologist
Jun 17, 2004

A lot of people learn through books and whatnot but I learned through the official tutorial docs and I think it was great for me. If you already understand programming then it's probably a fine option for you as well. I would recommend coming up with a not-too-complicated project that you can do as a learning goal, and then whenever you get far enough into the tutorial that you think you can accomplish it, stop reading and try the project.

SurgicalOntologist
Jun 17, 2004

Sockser posted:

When building out some code last week, I accidentally used {} to make an array instead of ()
e.g. arr = { ‘a’, ‘b’, ‘c’, }

This was working but the order was getting goofed, which led me to discover my error. Is this just generating a dictionary of only keys with None values?

It's called a set. It kind of acts like dictionary keys in that there are no duplicates, only hashable values are allowed, and order is undefined (although is this changing for sets too?), but there are no values, not even None. It's useful for keeping track of things where you don't want to keep multiple copies of anything, for getting the unique elements of another collection, or of course for set operations like union and intersection.

Edit: it's faster for membership testing (the "in" keyword) so I often use set literals in expressions like if extension in {'csv', 'tsv', 'xls', 'xlsx'}: (even though a speed consideration is pointless in this example, the semantics of using a set here work better as well).

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

That's a good use case for pandas. Something like

Python code:
pd.read_csv(filename, index_col=0).loc[keys_to_retrieve]
Edit: to elaborate, in pandas a Series/DataFrame is basically a dictionary that allows vectorized lookups.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply