Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Phobeste
Apr 9, 2006

never, like, count out Touchdown Tom, man

unpacked robinhood posted:

I have a small percentages of files out of a batch that don't parse well when opened with the default open(..)
I've managed to get around this by checking the encoding on each file with filemagic, and setting the encoding parameter accordingly.

Does it feel bad-practicy ?

No , that’s exactly what it’s for.

Adbot
ADBOT LOVES YOU

the yeti
Mar 29, 2008

memento disco



Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?)

OnceIWasAnOstrich
Jul 22, 2006

the yeti posted:

Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?)

I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented.

Is there a reason you can't call out to a binary for it?

the yeti
Mar 29, 2008

memento disco



OnceIWasAnOstrich posted:

I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented.

Is there a reason you can't call out to a binary for it?

Oh yeah I’m almost certainly gonna use 7zip with command line options, I was asking more for my further education :)

Boris Galerkin
Dec 17, 2011

I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!
Are there any websites that let me upload locally built packages (or upload via CI/CD) so that I can pip install them ~from the cloud~ ideally with a password and as cheaply as possible?

wolrah
May 8, 2006
what?

Boris Galerkin posted:

Are there any websites that let me upload locally built packages (or upload via CI/CD) so that I can pip install them ~from the cloud~ ideally with a password and as cheaply as possible?

This seems to be an official solution to do it yourself: https://pypi.org/project/pypiserver/

Toss it on a spare PC, Raspberry Pi, VM, VPS, whatever and set up your clients to look there.

mr_package
Jun 13, 2000
Found a gotcha with dataclasses: I've broken my model into objects based on very small dataclasses (key / value, value / date, etc. type objects) and these are frozen so I can use them as keys for other dataclass objects. Anyway you can probably guess what happens when you do this and then call asdict() on it: TypeError: unhashable type: 'dict'

A frozen dataclass with a 'compare=True' set on a field makes a great dictionary key and allowed me to flatten my previously very nested data structure and remove some redundant data but I'll have to roll my own dataclass-to-json now.

Dominoes
Sep 20, 2007

Hey bros, have another packaging Q. I feel like the only way to ensure a conflict-free dependency graph is to allow multiple versions of a dep, should there be a conflict. (eg dep A requires version 2.0 exactly of subdepC, and dep B requires version 2.1) From what I understand, you can only do this if you rename one. (It's folder name, and in the imports). Can y'all think of a clever way around this?

Maybe retroactively dive into the directories, and rename both the folders and the imports if the need arises?? I think this should be doable, and transparent, since it's only an issue for deps at least one nest-level below what the user's doing. I need to look into how Pip/Pipenv/Poetry do it, but I think Pipenv refuses to continue if there's a conflict. Would take some clever file/folder wrangling, parsing files for import statements etc. Might be a PITA.

Can y'all think of a better way?

Dominoes fucked around with this message at 14:38 on Jul 26, 2019

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Goddamn python, people have been dealing with this forever and you still haven't fixed your poo poo.

The folder name mangling is the easy part. The hard part is automatically modifying imports to account for the new package name. There's a ton of different ways to import a module (many of them dynamic) and reliably accounting for them all is a lot of work that many people have started on and gave up on.

Probably the very easiest to handle wrinkle:

Python code:
import importlib
module = "requests"
importlib.import_module(module)
It's pretty common to do a lot of string manipulation to build import names. Django, for example, does a lot of dynamic string manipulation to build import paths for all the INSTALLED_APPS.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Does anyone have a pointer to some reasonably thorough benchmarks for evaluating performance when running typical data science/ML tasks? I'm currently looking at https://github.com/numpy/numpy/tree/master/benchmarks, but maybe one of you ML expert types has something even better.

The source of the request is that I'm working on some functional tests for our ML people so that we can establish a known baseline for the performance of their Jupyter notebooks, then start switching things out and quantifying the performance difference (OpenBLAS vs MKL, that sort of stuff). Turns out none of them have an existing set of comprehensive benchmarks for this purpose, and I'm doing a little research/questioning before we start writing our own or packaging up those Numpy benchmarks for our testing.

cinci zoo sniper
Mar 15, 2013




It’s excruciatingly difficulty to define a typical data science or machine learning task clearly enough to be enable construction of generalised performance benchmarks, so I feel you’ll be best suited by writing your own.

pmchem
Jan 22, 2010


chutwig posted:

Does anyone have a pointer to some reasonably thorough benchmarks for evaluating performance when running typical data science/ML tasks? I'm currently looking at https://github.com/numpy/numpy/tree/master/benchmarks, but maybe one of you ML expert types has something even better.

The source of the request is that I'm working on some functional tests for our ML people so that we can establish a known baseline for the performance of their Jupyter notebooks, then start switching things out and quantifying the performance difference (OpenBLAS vs MKL, that sort of stuff). Turns out none of them have an existing set of comprehensive benchmarks for this purpose, and I'm doing a little research/questioning before we start writing our own or packaging up those Numpy benchmarks for our testing.

https://mlperf.org/

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I have a weird python question.

I've been having weird internet outages, so I whipped up a little program that pings a given URL once every second, then logs any errors with a timestamp to a file. For the purposes of having a tiny visual to go along with it, I have it print a '.' for successes, and a '!' for failures to the console. This works fine in, say, the VSCode console, but it doesn't update at all when I run it in the windows commandline or powershell through python.

When I had entire sentences for success/failure, it worked fine. If I make it so it logs a newline after every character, it also works fine (but does seem to only update every two characters?). Googling lead me to some weird regedit discussions on stackoverflow, but it didn't result in anything.

For examples:

Not working:
code:
while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!', end="")
    else:
        print('.', end="")
Working:

code:
while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!')
    else:
        print('.')
Any ideas?

QuarkJets
Sep 8, 2008

Falcon2001 posted:

I have a weird python question.

I've been having weird internet outages, so I whipped up a little program that pings a given URL once every second, then logs any errors with a timestamp to a file. For the purposes of having a tiny visual to go along with it, I have it print a '.' for successes, and a '!' for failures to the console. This works fine in, say, the VSCode console, but it doesn't update at all when I run it in the windows commandline or powershell through python.

When I had entire sentences for success/failure, it worked fine. If I make it so it logs a newline after every character, it also works fine (but does seem to only update every two characters?). Googling lead me to some weird regedit discussions on stackoverflow, but it didn't result in anything.

For examples:

Not working:
code:
while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!', end="")
    else:
        print('.', end="")
Working:

code:
while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!')
    else:
        print('.')
Any ideas?

Does it work if you use other characters, such as 'a' and 'b'?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

Does it work if you use other characters, such as 'a' and 'b'?

No, doesn't change it.

cinci zoo sniper
Mar 15, 2013




I’d recommend hacking tqdm like so for this use case.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

cinci zoo sniper posted:

I’d recommend hacking tqdm like so for this use case.

Hey cool, I'll check it out!

For the main problem, mostly I'm just wondering if there's some weird python rule I'm not aware of (or alternately, weird console interaction I'm not aware of).

cinci zoo sniper
Mar 15, 2013




Probably just a difference in display modes between the consoles - it’s so inconsistent in my experience that I’ve just given up and slam tqdm everywhere I can.

Foxfire_
Nov 8, 2010

You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline.

use print(whatever, flush=True)

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Yeah, anytime you have weird issues with stuff not updating in the console like you think it should start looking for the flush options.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Thermopyle posted:

Yeah, anytime you have weird issues with stuff not updating in the console like you think it should start looking for the flush options.

Foxfire_ posted:

You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline.

use print(whatever, flush=True)

Ah, thanks! That totally fixes it.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
If you have a helpful link as an answer, that'd be great, don't feel like you need to give me a detailed, personalized answer.

My Problem: I'm starting to have more and more python/selenium based web scrapers that get very similar data from different gov websites.

For example: I scrape the same data using 6 templates from 6 different government websites and format all the data to go in the same google sheet. Right now, I can run all of these these scrapers in about 10-15 minutes locally. We're considering upping this to 50 government websites/day, which goes a bit beyond what I can run sequentially on my local machine. Instead I'd like to run them simultaneously and automatically, simply making a log so I can catch any errors. One of them will run in to errors about 1 in every 4 or 5 days, but with 5X the websites it'll likely happen daily for a while.

TLDR: Whats the current industry standard to scrape 50+ websites per day using separate processes/workers? Is it time to figure out AWS lambda? Whats the industry standard way to log this?

CarForumPoster fucked around with this message at 21:08 on Jul 28, 2019

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

NinpoEspiritoSanto
Oct 22, 2013




Thermopyle posted:

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

This would be how I'd approach it as well. Pypy may also give some speed improvements but I've never used it with selenium. With python-rq you could run x number of workers on y number of machines being fed sites to scrape all putting the results wherever you like.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Thermopyle posted:

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

I have what I feel like is a stupid question...should I just make a function for each template then run it on AWS Lambda? Any real world experience as to why I'd I do that versus trying to figure out python-rq?

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I've done both and both are pretty easy.

I don't really think there's any more "figuring out" with either of them.

Lambda cost money of course. Doing it with redis may or may not depending on your current infrastructure.

It's not terribly uncommon to find web sites blocking the IP ranges of Amazon, Digital Ocean, Microsoft, etc, to prevent automated stuff like this.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Thermopyle posted:

I've done both and both are pretty easy.

I don't really think there's any more "figuring out" with either of them.

Lambda cost money of course. Doing it with redis may or may not depending on your current infrastructure.

It's not terribly uncommon to find web sites blocking the IP ranges of Amazon, Digital Ocean, Microsoft, etc, to prevent automated stuff like this.

Thanks for the encouragement. That’s a good point regarding the ip ranges.

I have a hilarious amount of AWS credits, more than I can use before they expire. My current infrastructure is nothing literally all our data is on S3 or sharepoint. There’s one switch in our office for the Ethernet jacks in the walls.

IAmKale
Jun 7, 2007

やらないか

Fun Shoe

Falcon2001 posted:

Ah, thanks! That totally fixes it.
You can also try setting PYTHONUNBUFFERED=0 as an environment variable when you execute your program. flush=True probably accomplishes the same thing, but with the environment variable you won't have to sprinkle the latter around your codebase.

i vomit kittens
Apr 25, 2019


I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function.

The functions are:

new_ev.date_time.strftime("%-m/%d/%y %-I:%M %p")

datetime.strptime(date_time, "%-m/%d/%y %-I:%M %p")

The error I'm getting when the bottom one is called is "ValueError: '-' is a bad directive in format '%-m/%d/%y %-I:%M %p'"

Can anyone point out to me why I'm having this problem?

Solumin
Jan 11, 2013
Support for '%-' is platform specific, and Python calls out to the system's strftime. So your platform's strftime supports `%-I` but Python's strptime only supports ISO 8601 strings.

Can I ask why you're converting a datetime object to a string and back?

Also, you may find this useful: http://strftime.org/

i vomit kittens
Apr 25, 2019


Solumin posted:

Can I ask why you're converting a datetime object to a string and back?

By "exact same string" I didn't mean the same object, just the same value (in this case since I was testing it "1/1/2001 12:00 AM"). Events are pulled from the db file to be displayed in a readable format on a table, so I use strftime to do that. It's whenever an event is created or modified that strptime is used to turn the user's input into a datetime object and verify that there's no other events at the same time.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

i vomit kittens posted:

I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function.

The functions are:

new_ev.date_time.strftime("%-m/%d/%y %-I:%M %p")

datetime.strptime(date_time, "%-m/%d/%y %-I:%M %p")

The error I'm getting when the bottom one is called is "ValueError: '-' is a bad directive in format '%-m/%d/%y %-I:%M %p'"

Can anyone point out to me why I'm having this problem?

Poster above me hit it. I deal with a lot of date strings gotten from a variety of web scraping and API calls. Working with datetime is a bitch in that use case. I've come around to prefer pendulum. Best part is much of the syntax is the same, its just a little easier to use when dealing with date strings from the wild.

Solumin posted:

Can I ask why you're converting a datetime object to a string and back?

My use case for this is I get them from a bunch of formats and then I use those dates in one format to do things like send emails via boto3/Amazon SES that have the date included.

CarForumPoster fucked around with this message at 16:16 on Aug 4, 2019

Solumin
Jan 11, 2013
I was thinking it would be easier to keep the date object around and only convert it to a string when needed. But obviously you don't have a choice when it's used input.

QuarkJets
Sep 8, 2008

Doesn't pandas provide string formatting for its datetime objects? Could that be any more consistent / usable in this case?

a foolish pianist
May 6, 2007

(bi)cyclic mutation

I typically use arrow instead of datetime.

unpacked robinhood
Feb 18, 2013

by Fluffdaddy
What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

xtal
Jan 9, 2011

by Fluffdaddy

unpacked robinhood posted:

What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

Just stick the path there, relative to some chosen user-media-directory, and make sure that directory is locked down. You can also store the content type and size to avoid looking it up at runtime, but it's not necessary.

cinci zoo sniper
Mar 15, 2013




Slightly shittier version would be to store jsons in the varchar, containing both root dir and relative path - if locking down the root is not feasible.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

unpacked robinhood posted:

What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

I have this usecase and the thing that worked for me is sending the file to S3 or SharePoint and just storing the URL and whatever content I want in that row.

EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames.

CarForumPoster fucked around with this message at 02:04 on Aug 6, 2019

Adbot
ADBOT LOVES YOU

unpacked robinhood
Feb 18, 2013

by Fluffdaddy

xtal posted:

Just stick the path there, relative to some chosen user-media-directory, and make sure that directory is locked down. You can also store the content type and size to avoid looking it up at runtime, but it's not necessary.

CarForumPoster posted:

EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames.

Thanks. Do you simply call uuidN() to generate a value and use it as filename ?

e


Thanks everyone

unpacked robinhood fucked around with this message at 13:33 on Aug 9, 2019

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply