|
Personally, even though I use miniconda I don't put its main bin folder on my path. I symlink activate and conda into my path, and then never use the root environment. So I don't get conda stuff on my path until I explicitly activate an environment.
|
# ¿ Mar 7, 2017 02:30 |
|
|
# ¿ Apr 29, 2024 12:45 |
|
QuarkJets posted:What if you're deploying code to multiple users on a common system? I usually write a shell wrapper that inline modifies the path and ld_library_path to use anaconda, if it's necessary (such as if my application links against the specific version of pythonxx.so that comes with anaconda). I've only had to face compilation issues with Anaconda a couple of times but I solved it by keeping everything inside Anaconda. Installed all the dependencies as Anaconda packages and then made a conda recipe for the actual compilation. Then you can distribute the precompiled conda package.
|
# ¿ Mar 7, 2017 14:30 |
|
Nice! Here's mine:Python code:
|
# ¿ Mar 10, 2017 02:39 |
|
Ah, I see. I tried to fix it by replacing cls with type(self) in __init__ and __repr__, but you can't get it to work with an overridden __init__ without using __new__. Nice solution. Edit: although, as long as proper subclass hygiene is used and the new init starts with a super().__init__ call, it should work. SurgicalOntologist fucked around with this message at 04:49 on Mar 10, 2017 |
# ¿ Mar 10, 2017 04:46 |
|
You can also avoid the issue in several ways. My favored solution would be to turn the stochastic process into a generator then use it like this: Python code:
Python code:
|
# ¿ Mar 15, 2017 15:19 |
|
Not sure if I should put this here or in the Scientific Computing thread, but whatever. I have a data analysis project that involves computing a series of variables on a dataset (specifically, I'm adding DataArrays to an xarray DataSet, but it's the same idea as if I were adding columns to a DataFrame). Each calculation requires some variables to already be available and adds one or more variables to the dataset. Some of these are expensive to calculate and may not be necessary to run in all circumstances so I don't want to run everything. In short, I want a way to handle this dependency graph. This seems like it should be a common pattern but I don't know of an existing easy solution. Closest thing I can think of is dask's DAG compute engine but this doesn't really fit into that model since I'm enriching the dataset object rather than passing around values. If there's no existing solution the the best option I can think of is to have something like a require method call at the top of every function that checks if the required elements are present and then runs the corresponding function if they are not. Any better ideas?
|
# ¿ Mar 16, 2017 22:59 |
|
Ah yes, that's the way it should work. But it is complicated slightly by the fact that these aren't attributes but rather DataArrays within a DataSet (analagous to columns in a DataFrame). And xarray does the same thing pandas does to allow these variables to be set with __setitem__ but retrieved with __getitem__. I could subclass DataSet and make my lazy variables true properties...but I'd probably have to override the name collision detection error. Or I could make my own get function: Python code:
|
# ¿ Mar 17, 2017 00:12 |
|
Hmm seems like this is more complicated. The methods I need to be calling aren't really getters but assigners, because sometimes coordinates need to be fixed or other implementation details when assigning a new data array. Also some of them assign multiple arrays. For example see these two methods: Python code:
Not sure that coming up with a general solution will be worth it. Maybe I could make the lazy properties work with composition instead of subclassing...
|
# ¿ Mar 17, 2017 00:28 |
|
Nope, you shouldn't need to figure out the html variable. response.text in your second block is the same as html.read() in your first. In other words, scrapy gets the html for you and passes it in as response. That is, if I'm understanding the scrapy API correctly based on what you posted.
|
# ¿ Mar 23, 2017 21:12 |
|
Yeah, you probably want to (a) decide if you want to write a new CSV for each page that is scraped, or just append to a single file; and (b) call writerow more than once (i.e. put it inside a loop).
|
# ¿ Mar 23, 2017 22:54 |
|
KingNastidon posted:Just getting started with pandas, so this is probably a really naive question. I'm trying to replicate a really basic algorithm from VBA for oncology forecasting. There is a data file with new patient starts and anticipated duration of therapy in monthly cohorts. Each new patient cohort loses a certain amount of patients each time period using an exponential function to mimic the shape of a kaplan-meier progression curve. The sum of a column represent the total remaining patients on therapy across all cohorts. Below is the code I'm using: I don't fully understand the algorithm but the first step is to try to do this with vectors operation. This might not work as written but may only need small tweaks. At least it should illustrate the idea of vectorizing the operation. Python code:
|
# ¿ Mar 31, 2017 12:16 |
|
Edit: nevermind I see what you're doing. Carry on.
SurgicalOntologist fucked around with this message at 00:48 on Apr 8, 2017 |
# ¿ Apr 8, 2017 00:45 |
|
Cingulate posted:What's the point of the '0's here? Is this best behavior somehow? You're right, seeing this again I would do it without the 0s.
|
# ¿ Apr 11, 2017 14:54 |
|
Cingulate posted:Different question: does Continuum make money? What are the chances "conda install all_the_software_i_need" won't work in 2018 because Travis Oliphant has to choose between making it slightly easier for me to set up numpy or feeding his kids? The nice thing about a heavily used open source project is that it cannot really go away. If Continuum disappears the code will still exist and there will be enough volunteers to maintain it.
|
# ¿ Apr 18, 2017 14:50 |
|
can you do something like this:Python code:
|
# ¿ Apr 18, 2017 15:20 |
|
That's reasonable too. In that case I would just doPython code:
Boris Galerkin posted:e: I want the method injected to be bound to the object so I can't just assign it as an attribute. which I don't follow. With the above it is bound to the object.
|
# ¿ Apr 18, 2017 18:06 |
|
Boris Galerkin posted:Maybe I got my terminology wrong. I thought bound meant the thing where it implicitly gets "self" passed to it. Eela6 posted:You don't need to do any special magic to create a method and attach it to an object. So long as the first parameter is self it should 'just work'. Phoneposting so i cant go into more detail, but if you want an example I do so in my easyrepr implementation from earlier in this thread. That's what I thought would happen (it would get self passed automatically), but I just checked and Boris is right, it doesn't. Huh. This is the first time in a while I've been surprised by Python. I guess you do need to do that thing with the types module (or use the first method I suggested). Edit: Eela6, IIRC about your easyrepr thing, it attaches to the class rather than an instance. In that case it works as expected.
|
# ¿ Apr 18, 2017 19:07 |
|
Yes it will.Python code:
|
# ¿ Apr 18, 2017 22:13 |
|
laxbro posted:Newbie question: I'm trying to build a web scraper with Beautiful Soup that pulls table rows off of the roster pages for a variety of college sports teams. I've got it working with regular HTML pages, but it doesn't seem to work with what appear to be Angular pages. Some quick googling makes it seem like I will need to use a python library like selenium to virtually load the page in order to scrape the html tables on the page. Would a work around be to first use Beautiful Soup, but if a table row returns as None, then call a function to try scraping the page using something like selenium. Or, should I just try to scrape all of the pages with selenium or a similar library? Load the page with Chrome deveoper tools open. The Network tab, I think, is where you can inspect requests. You might be able figure out how the page is fetching the data. I've been in a similar situation and found that the site was making a request to some JSON API before loading the data. In developer tools you can see what the request headers and response looks like. In the easiest case you can just hit the API url and get JSON back. In a more complicated case you can examine the request headers and mimic them in your scraper.
|
# ¿ Apr 23, 2017 15:19 |
|
x is the element (in your first code chunk), to get its data-symbol attribute, do x.attrs['data-symbol']. Using attrs in find_all would be for the use case of finding an element that had a specific data-symbol. Or, you could do something along those lines to get all the elements that have that attribute at all. But if you are already able to get the element itself another way, it's straightforward to grab any attribute.
|
# ¿ May 6, 2017 07:08 |
|
Looks like it expects redis to be running. Install redis and run it.
|
# ¿ Jun 3, 2017 18:59 |
|
Baby Babbeh posted:That... makes sense. I was overcomplicating this. It returns another df rather than changing it in place, right? As Cingulate mentioned, that would be an in-place function call. But I thought I would chime in to show how to do it as a function returning a new df: Python code:
Anyways, DataFrame.where (or more commonly for me Series.where) is one of the most common operations when I'm using method chaining.
|
# ¿ Jun 9, 2017 17:39 |
|
Malcolm XML posted:Perfectly fine if your algorithms are insensitive to things like that (i use floats in quant finance optimization models) but try simulating a chaotic system and watch as floats are completely useless Umm I simulate chaotic systems... what should I be using if not floats?
|
# ¿ Jun 11, 2017 22:23 |
|
Malcolm XML posted:a complicated algorithm that compensates for the fact that the system is very sensitive to accumulated approximation error and a lot of whiskey when b/c even then long term behavior is sketchy But yeah, dealing with chaotic systems, good luck if you need to make specific predictions. Investigating the qualitative behavior of the system is more useful anyway. In any case, you started by saying "floats are completely useless" for simulating complex systems, which I still don't understand--should I not be using floats? Or are just making the point that chaos magnifies errors? In which case, I have errors of much higher orders of magnitude I'm already concerned about.
|
# ¿ Jun 12, 2017 02:49 |
|
Specifically, you should be okay using scipy.integrate.odeint and not even thinking about the kind of solver it uses under the hood. You just need to formulate your problem in terms of a function that takes the state as an input vector and outputs a vector of the rates of change. It could easily be 1000x faster or more than doing Euler method by hand. Edit: and inside said function you should use dot product or matrix multiplication (via numpy) for all that multiplying and adding. SurgicalOntologist fucked around with this message at 19:15 on Jun 17, 2017 |
# ¿ Jun 17, 2017 19:12 |
|
The function you posted, which you determined by profiling is slowing your code down, is an implementation of the Euler method (the whole 1-second steps thing) for solving differential equations. So maybe you're also doing other things, but you absolutely are solving differential equations. And by the way, your bigger problem is interesting to me. I'm writing my dissertation on estimating parameters of differential equations from observed timeseries (if this makes sense: it's a hierarchical method in which the diff eq parameters are regression DVs. Allowing one to test hypotheses like "the experimental manipulation will significantly affect parameter x"). Anyways, it sounds like what you're doing would fit into my framework. It's not ready, so I'm not going to suggest you use it, but your problem sounds like something I could include in my intro or conclusion when I list the applications in various domains. When you get a chance, could you share in general terms the problem you're trying to solve, and the approach you're taking? I'll probably have follow-up questions so maybe a PM would be more appropriate so we don't clog up the thread.
|
# ¿ Jun 17, 2017 21:46 |
|
So just to clarify, you want to find the parameters that best reproduce the timeseries in the data? If so, this is a real tricky problem (and yes, you are solving differential equations along the way.. these are the simulations). I'd be surprised if you have any luck with the local optimization methods in scipy, and genetic algorithms are a rabbit hole I also went down, to lots of time spent for little reward. The state of the art is AFAIK this method. My dissertation is basically expanding it to the case where you have multiple data series and want to characterize how they their parameters vary. Which seems to be exactly what you want to do. Unfortunately I don't have an open-source package ready yet for the general case. On the other hand... I jumped straight to assuming you have a nonlinear, chaotic system, and I didn't look closely enough at the equations to see if that's the case or not. If you have a linear dynamical system you probably don't even need to solve it numerically and parameter estimation should be straightforward. But, if you do have a nonlinear system, and don't feel like implementing Abarbanel's method or tracking down his MATLAB implementation, you are left with something like what you are doing, what I call a "black box" parameter estimation method. Choose a parameter set to test, run the simulations, compute the error. In which case we're right back to where we started, regardless of what kind of optimization you end up doing over it, your immediate problem is speeding up the simulations. And the answer here is what was suggested before: formulate the system as a function which takes a state vector (i.e. numpy array) and parameters as input and outputs the rates of change of the state vector, and send it to scipy.integrate.odeint. If you can't figure it out from the scipy docs you should be able to find more examples online, or post back for help. Once that's fast I would start by testing parameter sets (in your case, IIUC, you are testing "super-parameter" sets in that they are parameters of the equations that determine the actual diffeq parameters. Or you are testing the actual equations rather than just their parameters. Either way, the same recommendations) more or less manually. Run some examples by hand to get a sense of how the errors behave. Do some brute force testing (i.e., test every single permutation over a range of the parameters--if you have a bunch it may have to be an incredibly coarse range, but still worth it) and plot the errors in various ways. Spend some time doing this kind of thing before you start looking into optimizers, because that's a whole can of worms and optimizers often don't respond well to "black box" type problems (where you can't tell it the Jacobian of the error function). "Getting a feel for it" will get you pretty far and help you to understand what the optimizers are doing. And again, if you do have a linear system, or use a method like Arbanel's to (sort of) lineraize your system, then you can determine the Jacobian and a basic optimizer will do the job. Good luck! You've picked a problem that is much harder than it first appears. SurgicalOntologist fucked around with this message at 17:46 on Jun 19, 2017 |
# ¿ Jun 19, 2017 17:44 |
|
code:
|
# ¿ Aug 2, 2017 05:49 |
|
I think with pandas you could doPython code:
|
# ¿ Aug 3, 2017 04:42 |
|
If I were you, I would get it working however you can, then work on improving/extending it. You shouldn't have to use iterrows or iteritems very often if at all. If you are replacing lookups, you can think of indexing into a pandas object (e.g. prices[items]) as a vectorized lookup--so you don't need to do them one at a time. If there's a specific thing you can't figure out how to do without iterating, post it (or a simplified example or whatever), those are fun challenges and I think transforming code from one style to another is a good way to learn.
|
# ¿ Sep 4, 2017 19:51 |
|
Slimchandi posted:For example of what I mean by excel versus python issue, there are calculations of customer losses in October that depend on the total customer numbers in September. Only then can you get total customers in October, which can be used to find losses in Nov etc. This kind of iterative approach makes sense to me in Excel, but in pandas I am used to calculating each series at a time, rather than two interdependently. For this kind of thing the trick is to "factor out" the inter-dependent aspects of the calculation. For example if you can first predict "percent customers lost", e.g. based on seasonality factors, then from there it's straightforward to build a series of actual customers lost or total customers or whatever you'd like. It will probably also be useful to index your data not by the actual month but by "months into the future", e.g. [0, 1, 2...]. You can always change the labels when you are visualizing/displaying the data.
|
# ¿ Sep 4, 2017 21:13 |
|
conda has a --file argument that is equivalent to the -r in pip. I tend to handle it by having one file conda-requirements.txt and another pip-requirements.txt. I've never had an issue with that setup. I would not attempt to copy over the actual environment, rather the instructions to create it, e.g. the requirements files and maybe a bash script (it can even start from scratch and include curling the miniconda install script). It will be much easier to maintain, easier to transfer, and less prone to weird errors. Just remember to keep it up to date. Regarding the MultiIndex question, the solution would be to just have the same value for all rows of that type. "both"? Spans aren't really a thing in pandas, because spans are more about displaying data than organizing/manipulating it. If you do want to display the data using spreadsheet concepts, you may be able to convert to excel and then continue to manipulate it from Python using one of the excel libraries. pyxl might be one, I don't recall. SurgicalOntologist fucked around with this message at 15:37 on Sep 15, 2017 |
# ¿ Sep 15, 2017 15:32 |
|
Boris Galerkin posted:I don't have internet access on some of these machines. My bad, I missed that. In that case I would say just copy over the entire miniconda folder. Everything you need is in there.
|
# ¿ Sep 15, 2017 22:09 |
|
Daviclond posted:Why does this one-liner: It's just because list.reverse is implemented as an in-place operation, changing an existing list (and returning nothing) rather than returning a new one. I would also prefer if it was the latter, but for many of the manipulation methods on built-in data structures, they are in-place operations.
|
# ¿ Sep 15, 2017 22:11 |
|
Boris Galerkin posted:I tried copying my entire miniconda folder instead except this doesn't work either because the path miniconda is installed to is hardcoded or something, so it can't be changed. I don't recall this being the case; the only exception I know of is the change to the PATH environment variable that the installer makes. Even if I'm wrong on that, can't you just install it to where you want it to end up on the first machine? Or symlink? Of the roadblocks you encountered, this seems like the easiest to solve.
|
# ¿ Sep 18, 2017 15:17 |
|
Matthew Rocklin has some great talks demonstrating the use of toolz. Check vimeo and YouTube.
|
# ¿ Oct 27, 2017 16:14 |
|
I'm banging my head against the wall on a numpy/pandas float issue. I have a DataFrame of simulation results (or observations, doesn't matter). There is a floating point column "time", that would be convenient as an index but because floating-point indexes are difficult I am using timestep (integer) as the index and keeping time as a regular column. However, a common operation is looking up the index at a particular time. Because time is a float it can't be done the easy way (this is the same reason not to use time as the index). Anyways, here is my attempt:Python code:
Here are my test: Python code:
Unfortunately, it fails consistently at time step 9: code:
In short, I can't figure out a good way to reliably get the closest index with floats. I really don't want to do something like argmin(abs(time_series - time_values)) for a frequently used lookup function, but is that the only way? SurgicalOntologist fucked around with this message at 17:47 on Nov 16, 2017 |
# ¿ Nov 16, 2017 17:44 |
|
A lot of people learn through books and whatnot but I learned through the official tutorial docs and I think it was great for me. If you already understand programming then it's probably a fine option for you as well. I would recommend coming up with a not-too-complicated project that you can do as a learning goal, and then whenever you get far enough into the tutorial that you think you can accomplish it, stop reading and try the project.
|
# ¿ Dec 7, 2017 21:12 |
|
Sockser posted:When building out some code last week, I accidentally used {} to make an array instead of () It's called a set. It kind of acts like dictionary keys in that there are no duplicates, only hashable values are allowed, and order is undefined (although is this changing for sets too?), but there are no values, not even None. It's useful for keeping track of things where you don't want to keep multiple copies of anything, for getting the unique elements of another collection, or of course for set operations like union and intersection. Edit: it's faster for membership testing (the "in" keyword) so I often use set literals in expressions like if extension in {'csv', 'tsv', 'xls', 'xlsx'}: (even though a speed consideration is pointless in this example, the semantics of using a set here work better as well).
|
# ¿ Dec 20, 2017 20:12 |
|
|
# ¿ Apr 29, 2024 12:45 |
|
That's a good use case for pandas. Something likePython code:
|
# ¿ Jan 8, 2018 18:12 |