|
unpacked robinhood posted:I have a small percentages of files out of a batch that don't parse well when opened with the default open(..) No , that’s exactly what it’s for.
|
# ? Jul 24, 2019 13:24 |
|
|
# ? May 14, 2024 04:35 |
|
Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?)
|
# ? Jul 24, 2019 21:21 |
|
the yeti posted:Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?) I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented. Is there a reason you can't call out to a binary for it?
|
# ? Jul 25, 2019 02:11 |
|
OnceIWasAnOstrich posted:I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented. Oh yeah I’m almost certainly gonna use 7zip with command line options, I was asking more for my further education
|
# ? Jul 25, 2019 02:54 |
|
Are there any websites that let me upload locally built packages (or upload via CI/CD) so that I can pip install them ~from the cloud~ ideally with a password and as cheaply as possible?
|
# ? Jul 25, 2019 13:47 |
|
Boris Galerkin posted:Are there any websites that let me upload locally built packages (or upload via CI/CD) so that I can pip install them ~from the cloud~ ideally with a password and as cheaply as possible? This seems to be an official solution to do it yourself: https://pypi.org/project/pypiserver/ Toss it on a spare PC, Raspberry Pi, VM, VPS, whatever and set up your clients to look there.
|
# ? Jul 25, 2019 17:07 |
|
Found a gotcha with dataclasses: I've broken my model into objects based on very small dataclasses (key / value, value / date, etc. type objects) and these are frozen so I can use them as keys for other dataclass objects. Anyway you can probably guess what happens when you do this and then call asdict() on it: TypeError: unhashable type: 'dict' A frozen dataclass with a 'compare=True' set on a field makes a great dictionary key and allowed me to flatten my previously very nested data structure and remove some redundant data but I'll have to roll my own dataclass-to-json now.
|
# ? Jul 25, 2019 19:33 |
|
Hey bros, have another packaging Q. I feel like the only way to ensure a conflict-free dependency graph is to allow multiple versions of a dep, should there be a conflict. (eg dep A requires version 2.0 exactly of subdepC, and dep B requires version 2.1) From what I understand, you can only do this if you rename one. (It's folder name, and in the imports). Can y'all think of a clever way around this? Maybe retroactively dive into the directories, and rename both the folders and the imports if the need arises?? I think this should be doable, and transparent, since it's only an issue for deps at least one nest-level below what the user's doing. I need to look into how Pip/Pipenv/Poetry do it, but I think Pipenv refuses to continue if there's a conflict. Would take some clever file/folder wrangling, parsing files for import statements etc. Might be a PITA. Can y'all think of a better way? Dominoes fucked around with this message at 14:38 on Jul 26, 2019 |
# ? Jul 26, 2019 14:36 |
|
Goddamn python, people have been dealing with this forever and you still haven't fixed your poo poo. The folder name mangling is the easy part. The hard part is automatically modifying imports to account for the new package name. There's a ton of different ways to import a module (many of them dynamic) and reliably accounting for them all is a lot of work that many people have started on and gave up on. Probably the very easiest to handle wrinkle: Python code:
|
# ? Jul 26, 2019 21:29 |
|
Does anyone have a pointer to some reasonably thorough benchmarks for evaluating performance when running typical data science/ML tasks? I'm currently looking at https://github.com/numpy/numpy/tree/master/benchmarks, but maybe one of you ML expert types has something even better. The source of the request is that I'm working on some functional tests for our ML people so that we can establish a known baseline for the performance of their Jupyter notebooks, then start switching things out and quantifying the performance difference (OpenBLAS vs MKL, that sort of stuff). Turns out none of them have an existing set of comprehensive benchmarks for this purpose, and I'm doing a little research/questioning before we start writing our own or packaging up those Numpy benchmarks for our testing.
|
# ? Jul 27, 2019 15:41 |
It’s excruciatingly difficulty to define a typical data science or machine learning task clearly enough to be enable construction of generalised performance benchmarks, so I feel you’ll be best suited by writing your own.
|
|
# ? Jul 27, 2019 16:01 |
|
chutwig posted:Does anyone have a pointer to some reasonably thorough benchmarks for evaluating performance when running typical data science/ML tasks? I'm currently looking at https://github.com/numpy/numpy/tree/master/benchmarks, but maybe one of you ML expert types has something even better. https://mlperf.org/
|
# ? Jul 27, 2019 16:08 |
|
I have a weird python question. I've been having weird internet outages, so I whipped up a little program that pings a given URL once every second, then logs any errors with a timestamp to a file. For the purposes of having a tiny visual to go along with it, I have it print a '.' for successes, and a '!' for failures to the console. This works fine in, say, the VSCode console, but it doesn't update at all when I run it in the windows commandline or powershell through python. When I had entire sentences for success/failure, it worked fine. If I make it so it logs a newline after every character, it also works fine (but does seem to only update every two characters?). Googling lead me to some weird regedit discussions on stackoverflow, but it didn't result in anything. For examples: Not working: code:
code:
|
# ? Jul 28, 2019 07:24 |
|
Falcon2001 posted:I have a weird python question. Does it work if you use other characters, such as 'a' and 'b'?
|
# ? Jul 28, 2019 07:45 |
|
QuarkJets posted:Does it work if you use other characters, such as 'a' and 'b'? No, doesn't change it.
|
# ? Jul 28, 2019 08:04 |
I’d recommend hacking tqdm like so for this use case.
|
|
# ? Jul 28, 2019 08:26 |
|
cinci zoo sniper posted:I’d recommend hacking tqdm like so for this use case. Hey cool, I'll check it out! For the main problem, mostly I'm just wondering if there's some weird python rule I'm not aware of (or alternately, weird console interaction I'm not aware of).
|
# ? Jul 28, 2019 08:28 |
Probably just a difference in display modes between the consoles - it’s so inconsistent in my experience that I’ve just given up and slam tqdm everywhere I can.
|
|
# ? Jul 28, 2019 08:53 |
|
You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline. use print(whatever, flush=True)
|
# ? Jul 28, 2019 08:57 |
|
Yeah, anytime you have weird issues with stuff not updating in the console like you think it should start looking for the flush options.
|
# ? Jul 28, 2019 17:29 |
|
Thermopyle posted:Yeah, anytime you have weird issues with stuff not updating in the console like you think it should start looking for the flush options. Foxfire_ posted:You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline. Ah, thanks! That totally fixes it.
|
# ? Jul 28, 2019 18:15 |
|
If you have a helpful link as an answer, that'd be great, don't feel like you need to give me a detailed, personalized answer. My Problem: I'm starting to have more and more python/selenium based web scrapers that get very similar data from different gov websites. For example: I scrape the same data using 6 templates from 6 different government websites and format all the data to go in the same google sheet. Right now, I can run all of these these scrapers in about 10-15 minutes locally. We're considering upping this to 50 government websites/day, which goes a bit beyond what I can run sequentially on my local machine. Instead I'd like to run them simultaneously and automatically, simply making a log so I can catch any errors. One of them will run in to errors about 1 in every 4 or 5 days, but with 5X the websites it'll likely happen daily for a while. TLDR: Whats the current industry standard to scrape 50+ websites per day using separate processes/workers? Is it time to figure out AWS lambda? Whats the industry standard way to log this? CarForumPoster fucked around with this message at 21:08 on Jul 28, 2019 |
# ? Jul 28, 2019 21:06 |
|
I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like. If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq. Easy to set up, and easily distributed across as many machines as you'd like. python-rq's documentation is easy to follow.
|
# ? Jul 28, 2019 21:37 |
Thermopyle posted:I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like. This would be how I'd approach it as well. Pypy may also give some speed improvements but I've never used it with selenium. With python-rq you could run x number of workers on y number of machines being fed sites to scrape all putting the results wherever you like.
|
|
# ? Jul 29, 2019 00:02 |
|
Thermopyle posted:I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like. I have what I feel like is a stupid question...should I just make a function for each template then run it on AWS Lambda? Any real world experience as to why I'd I do that versus trying to figure out python-rq?
|
# ? Jul 29, 2019 01:47 |
|
I've done both and both are pretty easy. I don't really think there's any more "figuring out" with either of them. Lambda cost money of course. Doing it with redis may or may not depending on your current infrastructure. It's not terribly uncommon to find web sites blocking the IP ranges of Amazon, Digital Ocean, Microsoft, etc, to prevent automated stuff like this.
|
# ? Jul 29, 2019 15:57 |
|
Thermopyle posted:I've done both and both are pretty easy. Thanks for the encouragement. That’s a good point regarding the ip ranges. I have a hilarious amount of AWS credits, more than I can use before they expire. My current infrastructure is nothing literally all our data is on S3 or sharepoint. There’s one switch in our office for the Ethernet jacks in the walls.
|
# ? Jul 29, 2019 17:09 |
|
Falcon2001 posted:Ah, thanks! That totally fixes it.
|
# ? Jul 29, 2019 17:16 |
I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function. The functions are: new_ev.date_time.strftime("%-m/%d/%y %-I:%M %p") datetime.strptime(date_time, "%-m/%d/%y %-I:%M %p") The error I'm getting when the bottom one is called is "ValueError: '-' is a bad directive in format '%-m/%d/%y %-I:%M %p'" Can anyone point out to me why I'm having this problem?
|
|
# ? Aug 4, 2019 01:21 |
|
Support for '%-' is platform specific, and Python calls out to the system's strftime. So your platform's strftime supports `%-I` but Python's strptime only supports ISO 8601 strings. Can I ask why you're converting a datetime object to a string and back? Also, you may find this useful: http://strftime.org/
|
# ? Aug 4, 2019 01:33 |
Solumin posted:Can I ask why you're converting a datetime object to a string and back? By "exact same string" I didn't mean the same object, just the same value (in this case since I was testing it "1/1/2001 12:00 AM"). Events are pulled from the db file to be displayed in a readable format on a table, so I use strftime to do that. It's whenever an event is created or modified that strptime is used to turn the user's input into a datetime object and verify that there's no other events at the same time.
|
|
# ? Aug 4, 2019 16:13 |
|
i vomit kittens posted:I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function. Poster above me hit it. I deal with a lot of date strings gotten from a variety of web scraping and API calls. Working with datetime is a bitch in that use case. I've come around to prefer pendulum. Best part is much of the syntax is the same, its just a little easier to use when dealing with date strings from the wild. Solumin posted:Can I ask why you're converting a datetime object to a string and back? My use case for this is I get them from a bunch of formats and then I use those dates in one format to do things like send emails via boto3/Amazon SES that have the date included. CarForumPoster fucked around with this message at 16:16 on Aug 4, 2019 |
# ? Aug 4, 2019 16:14 |
|
I was thinking it would be easier to keep the date object around and only convert it to a string when needed. But obviously you don't have a choice when it's used input.
|
# ? Aug 4, 2019 18:16 |
|
Doesn't pandas provide string formatting for its datetime objects? Could that be any more consistent / usable in this case?
|
# ? Aug 4, 2019 20:11 |
I typically use arrow instead of datetime.
|
|
# ? Aug 5, 2019 14:29 |
|
What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ? I'm pulling attachments from emails and turning them into rows in a table. There's a string column that could hold a path. Ideally I wouldn't change the schema.
|
# ? Aug 5, 2019 16:37 |
|
unpacked robinhood posted:What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ? Just stick the path there, relative to some chosen user-media-directory, and make sure that directory is locked down. You can also store the content type and size to avoid looking it up at runtime, but it's not necessary.
|
# ? Aug 5, 2019 17:47 |
Slightly shittier version would be to store jsons in the varchar, containing both root dir and relative path - if locking down the root is not feasible.
|
|
# ? Aug 5, 2019 18:06 |
|
unpacked robinhood posted:What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ? I have this usecase and the thing that worked for me is sending the file to S3 or SharePoint and just storing the URL and whatever content I want in that row. EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames. CarForumPoster fucked around with this message at 02:04 on Aug 6, 2019 |
# ? Aug 6, 2019 00:40 |
|
|
# ? May 14, 2024 04:35 |
|
xtal posted:Just stick the path there, relative to some chosen user-media-directory, and make sure that directory is locked down. You can also store the content type and size to avoid looking it up at runtime, but it's not necessary. CarForumPoster posted:EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames. Thanks. Do you simply call uuidN() to generate a value and use it as filename ? e Thanks everyone unpacked robinhood fucked around with this message at 13:33 on Aug 9, 2019 |
# ? Aug 6, 2019 09:31 |