Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

LuckySevens posted:

Oops, I just copy and pasted his style instead of rewriting, which was silly.


This is true. I'm only trying to build familiarity with how some models are built with python, not intending to ever go out and actually trade.


Also, is there a more pythonic way to do this?

code:
class StockOption:
   ......
    is_call: bool = field(default=True)
Since I'm using the European option class, is there a way to do an isinstance check for inheritance when I establish my dataclass init stuff?

What are you trying to achieve with this isinstance that isn’t achieved by just having the one dataclass inherit from another? Is there something you want to do in __init__ that the dataclass isn’t doing?

Imo you’re exposing too much of the implementation here. traverse_tree and stock_tree should have a leading underscore to indicate they’re ‘private’ and calc_price should be the only ‘public’ method (Might also want to rename that to something like calc_price_using_tree, since there are other methods to calculate an option price and you might want to extend the class to include those)

Also the trees used in binomial option pricing method aren’t especially complicated and can easily be represented with an array and some moderately clever indexing - generic tree data structures aren’t necessarily (and could be quite tricky to adapt, since nodes in the binomial tree can have two parents). If you wanted you could create a generic price_option_using_tree() method in the base class that calls the methods of the derived class(es) to calculate the payoffs in each node.

Adbot
ADBOT LOVES YOU

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Hollow Talk posted:

Also, depending on your python version (anything >=3.6 I think?), I would replace the string format stuff for logging with f-strings, which are both faster and more readable.

https://www.python.org/dev/peps/pep-0498/

The logging module’s printf-style formatting can be even faster, since it won’t evaluate the message if the logging level means it’s not going to be logged. Of course this makes your logging calls a lot uglier.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

fuf posted:

I'm confident I'll get the things I need eventually, it's just a slow process!

:allears:

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
From what you’re describing it doesn’t sound like putting the data into a SQL database is going to help you. It might be a better option for storing/retrieving the data in future (though you should think long and hard before creating a table with 200,000 columns), but if you’ve already got the data in a pandas data frame and don’t know what to do next it is not going to help.

Some more detail would be useful. What is the structure of your data and what analysis are you trying to run on it?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Spaced God posted:

I'll try to explain in vagueries while hopefully giving the scope of the problem. I don't want anyone knowing a comedy web forum helped me be a good intern lol

I work with people who get called out from central areas all over a city to various places. If every person in one central area's territory is busy, another person from a nearby territory responds to any new calls in that saturated area. It gets more complex because sometimes multiple people get called out to a site, as well as a few other variables. We have a big business and the spreadsheet is automatically generated from the dispatching system with a ton of data including dates and times and locations of where the person is going, and that's the data source.

Currently what my code does is take the sheet (which I've done some manual excel magic on to help the process, more on that later) and put it into a pandas dataframe, do some joins to gather all the data of what person belongs where into one df, and then queries every instance of where a person went to a place out of their territory dependent on a few extra variables via an embarrassing long conditional query. It's a very vague query and not entirely representative and that shows because it includes like 60% of the whole table. I'm trying to find a way to figure out, for example, if when this happens, was every available person already occupied determined via timestamp analysis, but I have no idea how that would work in pandas.

Likewise, I've been trying to find a way to help sanitize or validate data conditionally, too. Ideally is run a for loop through pandas to look at our in house codes for each territory on a row by row basis and strip useful info from them (they're six digit characters, with each pair representing something about the territory in decreasing scale ie city, neighborhood, street). That stuff I've mostly been doing in excel manually, which sucks but works, but ideally I want that to be automated.

Hopefully that explains what I'm doing? Monday mornings are never my strong point

I can only give vague advice on this kind of vague info, but it sounds like you are trying to do something quite difficult and have a moderate-to-severe case of running before you can walk.

A few things stand out:

When you say that your data is 60*200000, did you get that the wrong way round or do you really have 60 rows and 200,000 columns?

What does each row in your data frame represent? Is it an event where someone is called out, with date/time/location?

Can you do more basic queries, such as ‘how many call outs were there on June 19 2019?’ Or ‘how many call outs were there in total on region X during 2018?’ Can you do visualisations of these?

It sounds like your extraction, data cleaning and analysis steps are getting quite mixed up with each other - this is invariably a bad idea.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Spaced God posted:

Yeah this is definitely being popped out of the womb and being told to fly a spaceship. A lot of my time has been spent on stackexchange or deep in the confines of panda reference pages.

To answer your question, 6AM me got it backwards, it's 200,000 long with 60 columns. Each row represents one call, and columns provide lat/long of the address they were going to, time the phone was picked up, time the unit was sent, time the unit got there, time the unit left, if the unit was canceled, in what priority order they arrived, and a bunch of other things. The dataframe represents the entirety of one year, as they want an analysis done on a yearly basis.

The output ideally should be "during this year, here are the locations that had the most of these saturated incidents" and would most likely be tabular because my main gig is geospatial analysis so I can super easily just plunk that into ArcGIS and run with it.

Thankfully, my internship folks prefaced giving me this project with "we definitely can't do it, and you don't have access to the VPN during quarantine, so maybe you might be able to try?" and they're optimistic in a final product but understand if I don't get too much done, what with not really knowing Python beyond the basics

It sounds like if you can identify (from lat/long co-ordinates) when a call was to out of area then you get a decent chunk of the way there? You could add an extra column ‘is_out_of_area’ to the dataframe then reuse that in later queries?

One additional bit of advice I’d give is: don’t try to do too much in a single step, and don’t spend too long scouring the pandas docs hoping to find the one function that does exactly what you want. There’s no shame in just iterating over the dataframe row by row to do calculations, particularly if you’re new to both pandas and Python.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
And even if you could, so could the person sending totally_the_file_you_wanted_and_not_a_cryptolocker_check_the_hash_if_you_dont_believe_me.exe

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Phone posting so can’t post sample code, but if the only part of the file you want is:

code:
[Load of crap and commas I don’t care about]
“Total for functional location: [I want everything before the closing quote]”
,,,,[shitload more commas],,,,
Total: [I want this number too]
[More commas and crap I don’t care about]
Then I would probably just load the file as a string and use a regex to extract the relevant parts. The csv reading options mentioned above all assume the file contains data in a (roughly) columnar format where each row contains repeated instances of the same few columns, and this file... isn’t that.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

QuarkJets posted:

OP just needs to use pySFDSF

Take screenshots of the spreadsheet and train a neural net.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

my bitter bi rival posted:

Random question: I am working through Eric Matthes Python Crash Course right now and I've gotten a lot out of it. I'm currently going through the part about importing data from a csv and plotting it with matplotlib.

Here's a code sample, an example from the book:

code:
file = 'data/sitka_weather_07-2018_simple.csv'

with open(file) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    dates, highs = [], []
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        dates.append(current_date)
        highs.append(high)

plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, highs, c='red')
I'm wondering, is there a reason to put the put the data into two separate lists instead of a dictionary? It seems like the data is better organized this way:

code:
    data = {}
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        data[current_date] = high

plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(data.keys(), data.values(), c='red')
Since this is about downloading data and data visualization, is it a better practice to keep things in lists instead of dicts for some reason? Or is this more or less completely unimportant?

Using two lists has some downsides, but using a dict like this is an absolutely loving terrible idea for a number of reasons. I’ll limit myself to pointing out two.

Firstly, and most obviously, if there are multiple observations on the same date then you will lose data, since only the last such observation will be retained.

Secondly, pretty much every operation you might want to do on this data is significantly easier if they are in lists. For example - calculating the first differences in temperature highs is as trivial as
code:
high[:-1] -high[1:][
for a list, but is a pain in the arse if it is in a dict. Plotting one vs the other is similarly easier if both are lists.

This second point touches on the related issue that by putting the data into a dict like this you are throwing away the ordering. The source data file is ordered - there is a first row, a second row, etc - whereas dicts are not, so by putting this data in a dict this ordering is lost (In this example it may be recoverable if the date stamps are ordered with no duplicates, but that’s a very big if).

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Profile it and find out what’s taking the most time, then optimise that part. Repeat until performance is acceptable.

If you want more specific guidance you’ll need to post code and profiler output.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Do you want me give you a hint to point you in the right direction or solve the whole thing for you?

Assuming the former; you don’t need pandas for that. Just put the cipher text into a string and read the docs on string operations and python’s slicing syntax.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Famethrowa posted:

My code currently is attempting that, but I'm having issues with transversal. I need to slice, say, every third letter, and then once I hit the end of the string, loop back to the beginning of the string to finish the count and reslice. I could perhaps just duplicate the string many times over to achieve that, but that feels clunky as hell.

My reasoning for pandas is that the manual way of decoding the cipher involves a grid, and a column/row format would serve that purpose.

e. I've been messing with while loops and am just not quite there yet.

You said you didn't want hand holding, but you should use the slicing notation
code:
s[start:stop:step]
.

I'm not the one marking your work, but if I were I would deduct marks if you used pandas for this - it really isn't the right tool for the job.

I'd give a bonus point or two if you managed to solve it in one line.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
You shouldn’t use numpy either.

QuarkJets posted:

I feel like there may be a clever list comprehension that could give a fast, succinct solution

Not posting it since OP wants to figure things out themselves, but there’s a one-line solution.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Zoracle Zed posted:

Anyone else noted how awful it is googling for anything python-related these days? SEO means the first couple pages of results are all geeksforgeeks.com and other awful beginner tutorial spam. (Which, like, even for beginners, seem bad.)

Anyway, here's my actual question: in matplotlib.pyplot.plot there's this concept of a format string, so 'r--' means a red dashed line, etc. Somewhere in matplotlib, I assume, there has to be a format string parser that takes in 'r--' and spits out {color: 'r', linestyle: '--'} or whatever. Point me in the right direction, please? Whenever I write convenience plotting methods I end up fighting to balance the convenience of the very-compact format string style and actual structured argument handling.

You want to know the kwargs alternative to the format string, or the location of the parser module? If the latter why not just call it with an invalid string and see where the exception gets thrown from?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Have I misunderstood pip’s version specifiers, or is it doing something weird here?

If I run

code:
pip install msal>=1.9.0
I thought this means ‘install the most suitable version of msal above 1.9.0’ ? Instead it downloads four versions of msal and spits out an Error: ResolutionImpossible with the following info:

code:
The conflict is caused by:
msal 1.12.0 depends on cryptography <4 and >= 0.6
msal 1.11.0 depends on cryptography <4 and >= 0.6
msal 1.10.0 depends on cryptography <4 and >= 0.6 
msal 1.9.0 depends on cryptography <4 and >= 0.6
I don’t have the cryptography package installed, so can’t see any conflict here? Unless it’s trying to install all four versions and finding a conflict?

(I originally found this problem with a requirements.txt file that had a bunch of >= dependencies, also this is installing from an Azure artefacts repository that’s mirroring PyPI if that makes any difference)

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

OnceIWasAnOstrich posted:

No, you are using it right. Something in the dependency resolution it is doing seems to be calling for cryptography. I would normally say that something else you have installed or are installing has a conflict with the cryptography version number. What version of Python are you using? Maybe there are no cryptography packages for your Python in that range or the repo/your environment is busted.

You can also try pip installing a version of the package in the appropriate version range and see what it does. Now that I think about it if there was an actual conflicting dependency pip would normally list it in that same output.
Thanks.

Should have thought of doing that that myself, but tried it and found that the underlying problem is that there’s no cryptography package on this piece of poo poo azdo mirror of ours, so none of the packages can install.

No idea why pip wouldn’t just say that instead of bullshitting about conflicts, but hey ho.

In conclusion: gently caress computers

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Hed posted:

Isn’t cryptography the one that switched to linking against rusttls recently?

May not be a big deal depending on your environment but something to keep in mind.

Also, agreed re: computers.

Possibly true - I don’t know much about it other than that msal depends on it (msal is the Microsoft authentication library - employer uses OAuth all over the shop so I have to use this package a lot)

I also found that if I try to do this with two packages with missing dependencies I can send pip into an infinite loop, which is really stupid.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Does DataFrame.interpolate() do what you need ?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
If you’re doing that you might as well write a new context manager that deletes the tempfile on completion. Phone posting so can’t provide a snippet, but could post one later if no one else obliges in the meantime…

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

12 rats tied together posted:

the context manager is pretty easy and is a great python feature, you basically just define __enter__ and __exit__ methods

There's a simpler syntax since (I think) 3.5, using the @contextmanager decorator:

Python code:
from contextlib import contextmanager
import os

@contextmanager
def managed_tempfile(path):
    fh = open(path, 'w+')
    try:
        yield fh
    finally:
        fh.close()
        os.remove(path)

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

D34THROW posted:

it was just passing line_data...which was passing it a False for some reason, instead of None or whatever

This means that the name line_data has been defined as False somewhere in your code. It sounds like you didn’t (deliberately) do that, but are you importing any modules using the pattern from foo import *?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Do you want the first billion integers in a random order? numpy.random.permutation(n) will give you a random permutation of the integers up to n, but I have never tested it on a billion integers.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Epsilon Plus posted:

What I want is to get a list that sometimes starts [1, 3, 4, 7, 9...] and other times starts [1, 2, 3, 5, 8...] or maybe [2, 4, 5, 9, 12...]


If that’s all you want then just generate a sequence of random integers between 1 and 5 (or whatever you want the maximum difference to be) then take the cumulative sum of that sequence ( np.cumsum will do the job ).

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

AfricanBootyShine posted:

I have what I think is a really simple question with numPy. I finally have a job where I can actually use it for work, but it's been years since I did any real python work so I'm a bit lost.

I have a csv that contains readings for a bunch of samples at different wavelengths. I've pasted an example portion of it below. Normally it'll go all the way down to 300 nm. But I've trimmed it for everyone's sanity.

code:
Baseline 100%T,,SampleOx,,SampleOx1,,SampleOx2,,SampleRed,,SampleRed1,,SampleRed2,
Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs
700,2.521076918,700,0.051371451,700,0.020255247,700,-0.000277047,700,-0.013994155,700,-0.040811472,700,-0.046730809
699,2.515768766,699,0.056336451,699,0.021696234,699,0.002584572,699,-0.014951141,699,-0.038384374,699,-0.042782523
698,2.51525569,698,0.054913107,698,0.020626975,698,0.005365098,698,-0.013208756,698,-0.039243225,698,-0.044276398
697,2.517320871,697,0.051321168,697,0.018043108,697,-0.001523819,697,-0.01844346,697,-0.039591964,697,-0.044799961
696,2.516803503,696,0.048457876,696,0.016133199,696,-0.003205611,696,-0.019673269,696,-0.042768963,696,-0.048874158
First row are the sample names, which are spaced because each sample contains two values: a wavelength and an absorbance reading. I want to make a 3D array so that I can easily pick the absorbance at 400 nm for SampleRed. What's the easiest way to feed this into a 3D array, but still retain info like the sample names and the wavelengths?

I want to build this to be extensible, as I will be taking readings using this system for the next few years.

As posters above have commented, this is easy enough to turn into a pandas DataFrame via the DataFrame.read_csv() function. This is almost certainly what you actually want to do - a NumPy 3D array is going to be a lot more awkward to retrieve the correct data from.

Your data structure does look quite odd, though. Is there any reason you've arranged things as

code:
|    Baseline 100%T     |       SampleOx        |      SampleOx1        |   ...
| Wavelength (nm) | Abs | Wavelength (nm) | Abs | Wavelength (nm) | Abs |   ...
|       700       |     |       700       |     |       700       |     |   ...
|       699       |     |       699       |     |       699       |     |   ...
|       698       |     |       698       |     |       698       |     |   ...

All of the wavelength entries in each row seem identical, so wouldn't it make much more sense (and be much easier to work with) to arrange the data like this:

code:
| Wavelength (nm) | Baseline 100%T Abs | SampleOx Abs | SampleOx1 Abs | ...
|       700       |                    |              |               | ...
|       699       |                    |              |               | ...
|       698       |                    |              |               | ...
?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
code:
punctuation = [j for j in range(31,65)] + [j for j in range(91,97)]

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
setUp() and tearDown() are the recommended ways to share (de-) initialisation stages between tests in unittest, but I agree with just using pytest - apart from anything else it requires less typing (also it can run unittest tests, so no need to rewrite existing tests)

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

D34THROW posted:

"Why isn't this table getting populated? The data is formed well, the vars() of it looks good, everything is populated...lemme go look at the function."

Oh.

That's why.

No return statement to spit the prettified data back :doh:


On another note, I'm quite pleased with myself for realizing I can pass the error code and message as parameters to a boilerplate error.html page instead of a separate page for each message.

Use type hints. Any decent editor will warn you if a function hinted as returning a value does not return a value (and might also warn about returning an object of the wrong type, depending on how complex the definition is)

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Dawncloack posted:

I have to backfill information into an SQL database, and I am writing a python script for it.

I depend on some other piece of software to send unix timestamps to (and other info) and then get the info to backfill from it.
This program offers me two options: To get an old date and give me the data from that date or ask for real time information and get this exact moment's data.

I have a problem tho: The real time information has two decimals of precision that the old info does not.

What would be the easiest way to change my system's time a buncha times so that, combined with the other info, I get the better precision ?

Thanks!

edit: I am thinking it might be easier to change the information I feed the program and pretend it's all from right now.

I'm having a hard time believing that the data retrieval works in the way you describe, or that what you're attempting to do would even work, let alone whether it's a good idea.

Before you go creating your very own entry for the coding horrors thread, a few questions.

From what you've described like the system offers two interfaces to the data - get_historical_data(date) and get_realtime_data() (or similar). To me that screams that the system contains two data storage components - a real-time stream or queue and a data store populated from that stream/queue - and that get_realtime_data() simply retrieves the latest data from the stream. If that's the case then changing your system time won't get you the data you want (in addition to being horrific behaviour in and of itself).

Similarly, the most likely (to me) explanation for the difference in precision between the two methods is that the system has an archive data store that was improperly set up and the data gets truncated on storage. If that's the case there's nothing you can do to recover the lost precision.

Of course it's possible that get_realtime_data() is for some extremely hosed up reason getting your system time and using that to query the archive and that the truncation in the other method is happening in the interface instead of the actual storage, but all of that would require some truly galaxy-brained programming from whoever implemented the interface - so much so that I would seriously question whether any of the interfaces were even retrieving the correct data in the first place.

So have you verified that changing the system time in the call to get_latest_data() actually gets you the data you want, and not (eg) just the latest data with the timestamp changed?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Mycroft Holmes posted:

I've run into a problem with my homework. I have to take the following list:

code:
[{'rank': 1, 'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'year': 1813}, {'rank': 2, 'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'year': 1960}, {'rank': 3, 'title': 'The Great Gatsby', 'author': 'F. Scott Fitzgerald', 'year': 1925}, {'rank': 4, 'title': 'One Hundred Years of Solitude', 'author': 'Gabriel Garcia Marquez', 'year': 1967}, {'rank': 5, 'title': 'In Cold Blood', 'author': 'Truman Capote', 'year': 1965}, {'rank': 6, 'title': 'Wide Sargasso Sea', 'author': 'Jean Rhys"', 'year': 1966}, {'rank': 7, 'title': 'Brave New World', 'author': 'Aldous Huxley', 'year': 1932}, {'rank': 8, 'title': 'I Capture The Castle', 'author': 'Dodie Smith', 'year': 1948}, {'rank': 9, 'title': 'Jane Eyre,', 'author': 'Charlotte Bronte', 'year': 1847}, {'rank': 10, 'title': 'Crime and Punishment', 'author': 'Fyodor Dostoevsky', 'year': 1866}]
and my instructions are:

I have no idea how to do this. I have spent 20 minutes trying to use .split to do it, before realizing it won't work. Any suggestions?

There are some *very* compact ways to do this in python that it sounds like you haven't met yet (and the question hint is suggesting you not use). As this is a homework question I'll just give a couple of hints for now.

To populate book_data you'll need to iterate over each entry the current list, transform it into the specified format, then append that transformed value to book_data

To do the transformation you should look at either f-strings (if you've met them), or the .join() method.


Edit: okay, from your answers above sounds like we need to take a step back. Look at the first element of the list you've been given. What type of object is it? How would you get the title or author from that object?

DoctorTristan fucked around with this message at 19:52 on Feb 24, 2022

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

punk rebel ecks posted:

You lost me here. The tutorial has no "requirements.txt".

Also how do I run my python file from the virtual environment? Navigate to it's folder via the terminal?

I wish there was a way I could do all this from VSC. :(

Whoever wrote the tutorial might have skipped creating a requirements.txt because the project doesn’t have many dependencies. Try just running pip install flask inside the virtual env - that may be all you need.

To run a python file in a virtual environment, you launch a terminal, activate the environment in that terminal (as described by posters above), then do run the file as you normally would (python -m path_to_file if you’re on windows)

You absolutely can use virtual environments within VSCode - it wouldn’t be much use as a python dev environment if you couldn’t. Do ctrl-shift-P and search for the Python: Select Interpreter command - that will allow you to select the virtual environment. Now whenever you create a new terminal window within VSCode it will automatically activate the virtual environment in that window, so code running in that window will run in the virtual environment. Alternatively if you created the virtual environment within the workspace folder (which is what I do), VSCode will automatically detect it and ask if you want to use it.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

QuarkJets posted:

Each line of a CSV is usually going to have a newline character at the end of it so it's a pretty safe assumption, yeah. I'm sure there are madmen out there that use some other character for some reason but that's unusual


This is not a safe assumption at all for the last (data) line in the file - it’s only true there if the creator of the file ended with a blank line (which you’re supposed to do, but people frequently don’t bother).

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

QuarkJets posted:

I'm talking about the last character of every line, not the last line of every file

Let me rephrase.

If the file ends with a blank line then every line ends with a newline character.

If the file does not end with a blank line, then the last line does not end with a newline character, so the assumption is false.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Why are you using a regex instead of just slicing each input line??

Edit to be a little more helpful: if you know the widths of the columns, then I’d just call readline() in a loop to get each line out as a string and slice each string using the known fixed column widths. If you are already using pandas/don’t mind using it then the read_fwf() function will do that for you.

If you don’t know the column widths and are looking for a ‘smart’ library that can infer the column widths from the file itself then afaik you’re SOL.

DoctorTristan fucked around with this message at 10:07 on Jun 27, 2022

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Falcon2001 posted:

I look forward to 14.000000000001 pages about floating point number formatting weirdness.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
You’re partitioning the set into its equivalence classes defined by the relation ‘the timestamps overlap’ - I don’t believe there’s any faster way to do this than the naïve brute force method.

Since it looks like you’re doing it on a pandas dataframe it would probably be simpler to use a new column to keep track of the equivalence classes - phone posting so I’ll have to do it in pseudo code but the idea is:

* create a new integer column of zeros, called ‘equivc’ or whatever
* start with the first row, set .loc[0, ‘eqquivc’] = 1
* find every row that overlaps with a row having equivc == 1, set equivc=1 for those rows
* repeat until you don’t find any more rows
* now find the first row that still has equivc==0, set equivc=2 for that row, and repeat the above steps
* keep going until there are no rows left with equivc == 0

Once you’re done with this then every row with equivc==1 has mutually overlapping intervals, as does every row with equivc ==2 and so on.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
I don’t have any specific advice on pyinstaller, but I will comment that ‘a bunch of people running a .exe I pass around’ is a solution that may be okay in the short term but absolutely will come back to bite you sooner or later (exactly how quickly depends a bit on how big the organisation is and how quickly requirements change).

Hard to give detailed advice on what to do instead without more details on what you’re doing, but it does sound like what you *really* need is a database and some proper ETL tools.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Jose Cuervo posted:

I am looking through free text strings (short hospitalization reasons) for the word 'parathyroidectomy'. I am able to do simple string matching (e.g., looking for 'para' in example_string), but this type of searching assumes that the word has been spelled correctly and will not catch paarthyroidectomy, even though that would be a relevant result. Is there any library which would help me search these strings for misspelled matches?

Fuzzywuzzy

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Josh Lyman posted:

It says "KeyError: 2". I'll remove that as an argument and see if that helps. :shrug:

Of the last 20,000 or so mydf_in's, the KeyError occurred about 25 times.

edit: Modified the code to remove the function call and put the statements directly in the script and cleaned it up a bit, should help with debugging if the KeyErrors still happen.

Reducing encapsulation is rarely going to help you and I really recommend you don’t do that. I’d say about 40% of the weird python errors I’ve had to help coworkers with were caused by a name collision in a huge monolithic script, where breaking it up into smaller functions would have either prevented it entirely or made the error much more obvious.

It’s a bit inelegant, but have you tried catching the KeyError and setting a breakpoint inside the catch statement? That should let you inspect the variables at the point of the error and get a better idea of what’s going on.

Also, what’s your source for this data? Your comment about how three notebooks simultaneously hit an error kind of makes me suspect that the issue might originate with your data source (eg a database connection dropped, or JWT token expired) but is not getting caught until later.

Adbot
ADBOT LOVES YOU

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Deadite posted:

Can anyone help me understand why this example ends in an error:

code:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test = test.fillna('Missing')
test['a'] = np.where((test['a'] == 'Missing'), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test
but this works as expected?

code:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test['a'] = np.where((test['a'].isna()), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test
I would think that both where statements would only return the population that doesn't satisfy the first condition to the second where statement, so the first code shouldn't be converting any strings to float if the second code isn't, but it doesn't seem to work that way

In the first example you’re trying to call astype(float) on the string ‘Missing’

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply