Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

Phobeste: Apr 9, 2006; never, like, count out Touchdown Tom, man

unpacked robinhood posted:

I have a small percentages of files out of a batch that don't parse well when opened with the default open(..)
I've managed to get around this by checking the encoding on each file with filemagic, and setting the encoding parameter accordingly.

Does it feel bad-practicy ?

No , that�s exactly what it�s for.

# ? Jul 24, 2019 13:24

Adbot: ADBOT LOVES YOU

# ? May 14, 2024 04:35

the yeti: Mar 29, 2008; memento disco

Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?)

# ? Jul 24, 2019 21:21

OnceIWasAnOstrich: Jul 22, 2006

the yeti posted:

Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?)

I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented.

Is there a reason you can't call out to a binary for it?

# ? Jul 25, 2019 02:11

the yeti: Mar 29, 2008; memento disco

OnceIWasAnOstrich posted:

I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented.

Is there a reason you can't call out to a binary for it?

Oh yeah I�m almost certainly gonna use 7zip with command line options, I was asking more for my further education

# ? Jul 25, 2019 02:54

Boris Galerkin: Dec 17, 2011; I don't understand why I can't harass people online. Seriously, somebody please explain why I shouldn't be allowed to stalk others on social media!

Are there any websites that let me upload locally built packages (or upload via CI/CD) so that I can pip install them ~from the cloud~ ideally with a password and as cheaply as possible?

# ? Jul 25, 2019 13:47

wolrah: May 8, 2006; what?

Boris Galerkin posted:

Are there any websites that let me upload locally built packages (or upload via CI/CD) so that I can pip install them ~from the cloud~ ideally with a password and as cheaply as possible?

This seems to be an official solution to do it yourself: https://pypi.org/project/pypiserver/

Toss it on a spare PC, Raspberry Pi, VM, VPS, whatever and set up your clients to look there.

# ? Jul 25, 2019 17:07

mr_package: Jun 13, 2000

Found a gotcha with dataclasses: I've broken my model into objects based on very small dataclasses (key / value, value / date, etc. type objects) and these are frozen so I can use them as keys for other dataclass objects. Anyway you can probably guess what happens when you do this and then call asdict() on it: TypeError: unhashable type: 'dict'

A frozen dataclass with a 'compare=True' set on a field makes a great dictionary key and allowed me to flatten my previously very nested data structure and remove some redundant data but I'll have to roll my own dataclass-to-json now.

# ? Jul 25, 2019 19:33

Dominoes: Sep 20, 2007

Hey bros, have another packaging Q. I feel like the only way to ensure a conflict-free dependency graph is to allow multiple versions of a dep, should there be a conflict. (eg dep A requires version 2.0 exactly of subdepC, and dep B requires version 2.1) From what I understand, you can only do this if you rename one. (It's folder name, and in the imports). Can y'all think of a clever way around this?

Maybe retroactively dive into the directories, and rename both the folders and the imports if the need arises?? I think this should be doable, and transparent, since it's only an issue for deps at least one nest-level below what the user's doing. I need to look into how Pip/Pipenv/Poetry do it, but I think Pipenv refuses to continue if there's a conflict. Would take some clever file/folder wrangling, parsing files for import statements etc. Might be a PITA.

Can y'all think of a better way?

Dominoes fucked around with this message at 14:38 on Jul 26, 2019

# ? Jul 26, 2019 14:36

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Goddamn python, people have been dealing with this forever and you still haven't fixed your poo poo.

The folder name mangling is the easy part. The hard part is automatically modifying imports to account for the new package name. There's a ton of different ways to import a module (many of them dynamic) and reliably accounting for them all is a lot of work that many people have started on and gave up on.

Probably the very easiest to handle wrinkle:

Python code:

import importlib
module = "requests"
importlib.import_module(module)

It's pretty common to do a lot of string manipulation to build import names. Django, for example, does a lot of dynamic string manipulation to build import paths for all the INSTALLED_APPS.

# ? Jul 26, 2019 21:29

chutwig: May 28, 2001; BURLAP SATCHEL OF CRACKERJACKS

Does anyone have a pointer to some reasonably thorough benchmarks for evaluating performance when running typical data science/ML tasks? I'm currently looking at https://github.com/numpy/numpy/tree/master/benchmarks, but maybe one of you ML expert types has something even better.

The source of the request is that I'm working on some functional tests for our ML people so that we can establish a known baseline for the performance of their Jupyter notebooks, then start switching things out and quantifying the performance difference (OpenBLAS vs MKL, that sort of stuff). Turns out none of them have an existing set of comprehensive benchmarks for this purpose, and I'm doing a little research/questioning before we start writing our own or packaging up those Numpy benchmarks for our testing.

# ? Jul 27, 2019 15:41

cinci zoo sniper: Mar 15, 2013

It�s excruciatingly difficulty to define a typical data science or machine learning task clearly enough to be enable construction of generalised performance benchmarks, so I feel you�ll be best suited by writing your own.

# ? Jul 27, 2019 16:01

pmchem: Jan 22, 2010

chutwig posted:

Does anyone have a pointer to some reasonably thorough benchmarks for evaluating performance when running typical data science/ML tasks? I'm currently looking at https://github.com/numpy/numpy/tree/master/benchmarks, but maybe one of you ML expert types has something even better.

The source of the request is that I'm working on some functional tests for our ML people so that we can establish a known baseline for the performance of their Jupyter notebooks, then start switching things out and quantifying the performance difference (OpenBLAS vs MKL, that sort of stuff). Turns out none of them have an existing set of comprehensive benchmarks for this purpose, and I'm doing a little research/questioning before we start writing our own or packaging up those Numpy benchmarks for our testing.

https://mlperf.org/

# ? Jul 27, 2019 16:08

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

I have a weird python question.

I've been having weird internet outages, so I whipped up a little program that pings a given URL once every second, then logs any errors with a timestamp to a file. For the purposes of having a tiny visual to go along with it, I have it print a '.' for successes, and a '!' for failures to the console. This works fine in, say, the VSCode console, but it doesn't update at all when I run it in the windows commandline or powershell through python.

When I had entire sentences for success/failure, it worked fine. If I make it so it logs a newline after every character, it also works fine (but does seem to only update every two characters?). Googling lead me to some weird regedit discussions on stackoverflow, but it didn't result in anything.

For examples:

Not working:

code:

while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!', end="")
    else:
        print('.', end="")

Working:

code:

while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!')
    else:
        print('.')

Any ideas?

# ? Jul 28, 2019 07:24

QuarkJets: Sep 8, 2008

Falcon2001 posted:

I have a weird python question.

I've been having weird internet outages, so I whipped up a little program that pings a given URL once every second, then logs any errors with a timestamp to a file. For the purposes of having a tiny visual to go along with it, I have it print a '.' for successes, and a '!' for failures to the console. This works fine in, say, the VSCode console, but it doesn't update at all when I run it in the windows commandline or powershell through python.

When I had entire sentences for success/failure, it worked fine. If I make it so it logs a newline after every character, it also works fine (but does seem to only update every two characters?). Googling lead me to some weird regedit discussions on stackoverflow, but it didn't result in anything.

For examples:

Not working:
code:
while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!', end="")
    else:
        print('.', end="")
Working:
code:
while True:
    time.sleep(1)
    if ping(target) == None:
        logging.warning('Ping to '+target+' failed!')
        print('!')
    else:
        print('.')
Any ideas?

Does it work if you use other characters, such as 'a' and 'b'?

# ? Jul 28, 2019 07:45

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

QuarkJets posted:

Does it work if you use other characters, such as 'a' and 'b'?

No, doesn't change it.

# ? Jul 28, 2019 08:04

cinci zoo sniper: Mar 15, 2013

I�d recommend hacking tqdm like so for this use case.

# ? Jul 28, 2019 08:26

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

cinci zoo sniper posted:

I�d recommend hacking tqdm like so for this use case.

Hey cool, I'll check it out!

For the main problem, mostly I'm just wondering if there's some weird python rule I'm not aware of (or alternately, weird console interaction I'm not aware of).

# ? Jul 28, 2019 08:28

cinci zoo sniper: Mar 15, 2013

Probably just a difference in display modes between the consoles - it�s so inconsistent in my experience that I�ve just given up and slam tqdm everywhere I can.

# ? Jul 28, 2019 08:53

Foxfire_: Nov 8, 2010

You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline.

use print(whatever, flush=True)

# ? Jul 28, 2019 08:57

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Yeah, anytime you have weird issues with stuff not updating in the console like you think it should start looking for the flush options.

# ? Jul 28, 2019 17:29

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Thermopyle posted:

Yeah, anytime you have weird issues with stuff not updating in the console like you think it should start looking for the flush options.

Foxfire_ posted:

You're writing to stdout and it's line buffered by default. It won't flush to the OS unless a buffer fills up or you write a newline.

use print(whatever, flush=True)

Ah, thanks! That totally fixes it.

# ? Jul 28, 2019 18:15

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

If you have a helpful link as an answer, that'd be great, don't feel like you need to give me a detailed, personalized answer.

My Problem: I'm starting to have more and more python/selenium based web scrapers that get very similar data from different gov websites.

For example: I scrape the same data using 6 templates from 6 different government websites and format all the data to go in the same google sheet. Right now, I can run all of these these scrapers in about 10-15 minutes locally. We're considering upping this to 50 government websites/day, which goes a bit beyond what I can run sequentially on my local machine. Instead I'd like to run them simultaneously and automatically, simply making a log so I can catch any errors. One of them will run in to errors about 1 in every 4 or 5 days, but with 5X the websites it'll likely happen daily for a while.

TLDR: Whats the current industry standard to scrape 50+ websites per day using separate processes/workers? Is it time to figure out AWS lambda? Whats the industry standard way to log this?

CarForumPoster fucked around with this message at 21:08 on Jul 28, 2019

# ? Jul 28, 2019 21:06

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

# ? Jul 28, 2019 21:37

NinpoEspiritoSanto: Oct 22, 2013

Thermopyle posted:

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

This would be how I'd approach it as well. Pypy may also give some speed improvements but I've never used it with selenium. With python-rq you could run x number of workers on y number of machines being fed sites to scrape all putting the results wherever you like.

# ? Jul 29, 2019 00:02

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Thermopyle posted:

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

I have what I feel like is a stupid question...should I just make a function for each template then run it on AWS Lambda? Any real world experience as to why I'd I do that versus trying to figure out python-rq?

# ? Jul 29, 2019 01:47

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

I've done both and both are pretty easy.

I don't really think there's any more "figuring out" with either of them.

Lambda cost money of course. Doing it with redis may or may not depending on your current infrastructure.

It's not terribly uncommon to find web sites blocking the IP ranges of Amazon, Digital Ocean, Microsoft, etc, to prevent automated stuff like this.

# ? Jul 29, 2019 15:57

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Thermopyle posted:

I've done both and both are pretty easy.

I don't really think there's any more "figuring out" with either of them.

Lambda cost money of course. Doing it with redis may or may not depending on your current infrastructure.

It's not terribly uncommon to find web sites blocking the IP ranges of Amazon, Digital Ocean, Microsoft, etc, to prevent automated stuff like this.

Thanks for the encouragement. That�s a good point regarding the ip ranges.

I have a hilarious amount of AWS credits, more than I can use before they expire. My current infrastructure is nothing literally all our data is on S3 or sharepoint. There�s one switch in our office for the Ethernet jacks in the walls.

# ? Jul 29, 2019 17:09

IAmKale: Jun 7, 2007; やらないか; Fun Shoe

Falcon2001 posted:

Ah, thanks! That totally fixes it.

You can also try setting PYTHONUNBUFFERED=0 as an environment variable when you execute your program. flush=True probably accomplishes the same thing, but with the environment variable you won't have to sprinkle the latter around your codebase.

# ? Jul 29, 2019 17:16

i vomit kittens: Apr 25, 2019

I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function.

The functions are:

new_ev.date_time.strftime("%-m/%d/%y %-I:%M %p")

datetime.strptime(date_time, "%-m/%d/%y %-I:%M %p")

The error I'm getting when the bottom one is called is "ValueError: '-' is a bad directive in format '%-m/%d/%y %-I:%M %p'"

Can anyone point out to me why I'm having this problem?

# ? Aug 4, 2019 01:21

Solumin: Jan 11, 2013

Support for '%-' is platform specific, and Python calls out to the system's strftime. So your platform's strftime supports `%-I` but Python's strptime only supports ISO 8601 strings.

Can I ask why you're converting a datetime object to a string and back?

Also, you may find this useful: http://strftime.org/

# ? Aug 4, 2019 01:33

i vomit kittens: Apr 25, 2019

Solumin posted:

Can I ask why you're converting a datetime object to a string and back?

By "exact same string" I didn't mean the same object, just the same value (in this case since I was testing it "1/1/2001 12:00 AM"). Events are pulled from the db file to be displayed in a readable format on a table, so I use strftime to do that. It's whenever an event is created or modified that strptime is used to turn the user's input into a datetime object and verify that there's no other events at the same time.

# ? Aug 4, 2019 16:13

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

i vomit kittens posted:

I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function.

The functions are:

new_ev.date_time.strftime("%-m/%d/%y %-I:%M %p")

datetime.strptime(date_time, "%-m/%d/%y %-I:%M %p")

The error I'm getting when the bottom one is called is "ValueError: '-' is a bad directive in format '%-m/%d/%y %-I:%M %p'"

Can anyone point out to me why I'm having this problem?

Poster above me hit it. I deal with a lot of date strings gotten from a variety of web scraping and API calls. Working with datetime is a bitch in that use case. I've come around to prefer pendulum. Best part is much of the syntax is the same, its just a little easier to use when dealing with date strings from the wild.

Solumin posted:

Can I ask why you're converting a datetime object to a string and back?

My use case for this is I get them from a bunch of formats and then I use those dates in one format to do things like send emails via boto3/Amazon SES that have the date included.

CarForumPoster fucked around with this message at 16:16 on Aug 4, 2019

# ? Aug 4, 2019 16:14

Solumin: Jan 11, 2013

I was thinking it would be easier to keep the date object around and only convert it to a string when needed. But obviously you don't have a choice when it's used input.

# ? Aug 4, 2019 18:16

QuarkJets: Sep 8, 2008

Doesn't pandas provide string formatting for its datetime objects? Could that be any more consistent / usable in this case?

# ? Aug 4, 2019 20:11

a foolish pianist: May 6, 2007; (bi)cyclic mutation

I typically use arrow instead of datetime.

# ? Aug 5, 2019 14:29

unpacked robinhood: Feb 18, 2013; by Fluffdaddy

What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

# ? Aug 5, 2019 16:37

xtal: Jan 9, 2011; by Fluffdaddy

unpacked robinhood posted:

What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

Just stick the path there, relative to some chosen user-media-directory, and make sure that directory is locked down. You can also store the content type and size to avoid looking it up at runtime, but it's not necessary.

# ? Aug 5, 2019 17:47

cinci zoo sniper: Mar 15, 2013

Slightly shittier version would be to store jsons in the varchar, containing both root dir and relative path - if locking down the root is not feasible.

# ? Aug 5, 2019 18:06

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

unpacked robinhood posted:

What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

I have this usecase and the thing that worked for me is sending the file to S3 or SharePoint and just storing the URL and whatever content I want in that row.

EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames.

CarForumPoster fucked around with this message at 02:04 on Aug 6, 2019

# ? Aug 6, 2019 00:40

Adbot: ADBOT LOVES YOU

# ? May 14, 2024 04:35

unpacked robinhood: Feb 18, 2013; by Fluffdaddy

xtal posted:

Just stick the path there, relative to some chosen user-media-directory, and make sure that directory is locked down. You can also store the content type and size to avoid looking it up at runtime, but it's not necessary.

CarForumPoster posted:

EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames.

Thanks. Do you simply call uuidN() to generate a value and use it as filename ?

e

CarForumPoster posted:

xtal posted:

Thanks everyone

unpacked robinhood fucked around with this message at 13:33 on Aug 9, 2019

# ? Aug 6, 2019 09:31

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »