Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

Rocko Bonaparte: Mar 12, 2002; Every day is Friday!

Has anybody here used threading.local? I can (and probably will) do some experiments before getting into it, but I'm trying to understand its caveats and gotchas. I am in a situation where I need to pass in some context to a callback. The system that uses that callback is stateless and doesn't need to worry about thread safety. However, the callback I'm using is contextual so I need different state between threads. I'm trying to figure out how I might manage that state.

My understanding is threading.local() just gives me some handle to shove poo poo on a per-thread basis. I first need to figure out if threading.local() will give a unique handle each time that I have to then juggle or if it gives me the same handle each time for the same thread. If the former, then I guess I have to stash it somehow and resolve it on a per-thread basis. At that point, I don't know if it matters that I use it in the first place, so I was hoping it was the latter. The other issue then is cleaning up afterwards. I guess I need to make sure I delete the stuff so I don't leave it assigned to the thread for eternity.

# ? Nov 3, 2020 19:53

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 01:11

susan b buffering: Nov 14, 2016

Rocko Bonaparte posted:

Has anybody here used threading.local? I can (and probably will) do some experiments before getting into it, but I'm trying to understand its caveats and gotchas. I am in a situation where I need to pass in some context to a callback. The system that uses that callback is stateless and doesn't need to worry about thread safety. However, the callback I'm using is contextual so I need different state between threads. I'm trying to figure out how I might manage that state.

My understanding is threading.local() just gives me some handle to shove poo poo on a per-thread basis. I first need to figure out if threading.local() will give a unique handle each time that I have to then juggle or if it gives me the same handle each time for the same thread. If the former, then I guess I have to stash it somehow and resolve it on a per-thread basis. At that point, I don't know if it matters that I use it in the first place, so I was hoping it was the latter. The other issue then is cleaning up afterwards. I guess I need to make sure I delete the stuff so I don't leave it assigned to the thread for eternity.

Threading.local creates an object whenever you call it, so you should assign that object to a variable and then store local data as attributes on it. That object will persist as long as you need it, but calling threading.local again will just give you a new object.

# ? Nov 3, 2020 22:21

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Rookie question:

How do I find/read built-in function implementations in Python? E.g. I have Python 3.9 and I'd like to see how they implemented the max() or any() function. I've been poking around github and Google but haven't had any luck.

# ? Nov 5, 2020 00:58

Phobeste: Apr 9, 2006; never, like, count out Touchdown Tom, man

Hughmoris posted:

Rookie question:

How do I find/read built-in function implementations in Python? E.g. I have Python 3.9 and I'd like to see how they implemented the max() or any() function. I've been poking around github and Google but haven't had any luck.

Depends on which ones you mean. max and min and other builtins and language-provided functions specifically are in c but the standard library modules are by and large in the Lib directory of the cpython repo

# ? Nov 5, 2020 01:15

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Phobeste posted:

Depends on which ones you mean. max and min and other builtins and language-provided functions specifically are in c but the standard library modules are by and large in the Lib directory of the cpython repo

Thanks!

# ? Nov 5, 2020 01:19

Data Graham: Dec 28, 2009; 📈📊🍪😋

And if you use an IDE like PyCharm etc, you can cmd-click on any function and it'll take you right to its definition.

# ? Nov 5, 2020 01:20

salisbury shake: Dec 27, 2011

I have a Python script that I'd like to redirect stdin to stdout in, like you can do in the shell. This is how I'm using the script:

code:

$ echo "hi" | python3 my_script.py
hi
$ pv /dev/zero | python3 my_script.py > /dev/null

There's an issue with manually iterating over stdin and writing it to stdin out because Python for-loops are slow:

code:

from sys import stdin, stdout

for line in stdin.buffer:
  stdout.buffer.write(line)

The above is not only slow, it maxes out my CPU at only something like 150MB/s.

Going further by removing the explicit for-loop and using iterators yields roughly the same speeds and CPU utilization:

code:

from sys import stdin, stdout

stdout.buffer.writelines(stdin.buffer)

However, when redirecting standard input using a subprocess, I can hit 3+ GB/s with much less CPU usage:

code:

from sys import stdin, stdout
from subprocess import run

run(
  'cat',
  shell=True,
  stdin=stdin.buffer,
  stdout=stdout.buffer
)

The problem is that the above isn't cross-platform, and it bothers me that I need to shell out to a subprocess to redirect standard input efficiently.

Given that I just want to redirect stdin to stdout, I figure I need to swap their file descriptors using something like os.dup2() instead of using Python to explicitly iterate. Anyone know how I'd go about that?

salisbury shake fucked around with this message at 22:00 on Nov 9, 2020

# ? Nov 9, 2020 21:52

NinpoEspiritoSanto: Oct 22, 2013

Why not use subprocess.PIPE directly, no need to use shell then

# ? Nov 10, 2020 00:26

Jakabite: Jul 31, 2010

So I'm trying to install pygame because the book I'm using has a project which uses it. I'm using Anaconda. I created a 2.7 environment because pygame apparently doesn't work in 2.8 yet. I used pip to install. Here's the console:

(base) C:\Users\User>pip install pygame
Collecting pygame
Using cached pygame-2.0.0-cp38-cp38-win_amd64.whl (5.1 MB)
Installing collected packages: pygame
Successfully installed pygame-2.0.0

Yet it still doesn't recognise in Spyder when I try to import pygame - just 'no such module exists'. I had previous issues but thought it was due to my messy installation on my external HDD so cleared the room on my main drive and did it all to the letter of the book - still no good. Any tips?

# ? Nov 10, 2020 01:25

NinpoEspiritoSanto: Oct 22, 2013

Have you added that base venv interpreter to spyder?

# ? Nov 10, 2020 02:02

Jakabite: Jul 31, 2010

Bundy posted:

Have you added that base venv interpreter to spyder?

I have no idea what this is, but I�ll get on it if that seems necessary?

# ? Nov 10, 2020 02:05

NinpoEspiritoSanto: Oct 22, 2013

Jakabite posted:

I have no idea what this is, but I’ll get on it if that seems necessary?

Phone posting but in your paste earlier you had (base) in front of your prompt for pip, suggesting an activated venv.

If spyder isn't aware of that venv it won't find the module you installed in it. One reason I switched to pycharm was less faff managing venvs.

# ? Nov 10, 2020 02:15

Jakabite: Jul 31, 2010

Ah yeah I had just activated one for the previous version of python. Overall for what I need Anaconda is seeming like more hassle than it�s worth. Might be time to switch.

# ? Nov 10, 2020 02:45

namlosh: Feb 11, 2014; I name this haircut "The Sad Rhino".

salisbury shake posted:

I have a Python script that I'd like to redirect stdin to stdout in, like you can do in the shell. This is how I'm using the script:
code:
$ echo "hi" | python3 my_script.py
hi
$ pv /dev/zero | python3 my_script.py > /dev/null
There's an issue with manually iterating over stdin and writing it to stdin out because Python for-loops are slow:
code:
from sys import stdin, stdout

for line in stdin.buffer:
  stdout.buffer.write(line)
The above is not only slow, it maxes out my CPU at only something like 150MB/s.

Going further by removing the explicit for-loop and using iterators yields roughly the same speeds and CPU utilization:
code:
from sys import stdin, stdout

stdout.buffer.writelines(stdin.buffer)
However, when redirecting standard input using a subprocess, I can hit 3+ GB/s with much less CPU usage:
code:
from sys import stdin, stdout
from subprocess import run

run(
  'cat',
  shell=True,
  stdin=stdin.buffer,
  stdout=stdout.buffer
)
The problem is that the above isn't cross-platform, and it bothers me that I need to shell out to a subprocess to redirect standard input efficiently.

Given that I just want to redirect stdin to stdout, I figure I need to swap their file descriptors using something like os.dup2() instead of using Python to explicitly iterate. Anyone know how I'd go about that?

I�m not super advanced with Linux or python. I know my way around both though... and I�m curious, why would you want to do this
A) in python
B) at all

Reading stdin and pushing to stdout that is. Isn�t that like, typing on a screen?

Excuse me is this is a really dumb question... I mean no offense, I simply don�t know

# ? Nov 10, 2020 14:57

DearSirXNORMadam: Aug 1, 2009

For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations.

Also the documentation makes OMINOUS ALLUSIONS to the fact that you can't rely on garbage collection for pools and you have to close them manually. So if I store a pool, do I like... have to actually write a destructor for whatever object encloses it telling it to terminate the pool?

# ? Nov 10, 2020 21:17

OnceIWasAnOstrich: Jul 22, 2006

Mirconium posted:

For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations.

Also the documentation makes OMINOUS ALLUSIONS to the fact that you can't rely on garbage collection for pools and you have to close them manually. So if I store a pool, do I like... have to actually write a destructor for whatever object encloses it telling it to terminate the pool?

You have to close the pool one way or another, it won't happen if it is an attribute of an object that gets garbage collected. The easiest way is to open pools with context managers and the with keyword.

This means you can re-use a pool like you mention to save spinning up new processes but you need to keep track of it and ensure you close it eventually. If your parallel processes take long enough just re-make the pool, but if spinning up a new pool takes enough time it is worth it to keep it around, especially if you use a length pool intitializer function. You could create the pool in an outside context with a context manager and have the entire lifecycle of your object happen inside of that I suppose.

# ? Nov 10, 2020 21:45

salisbury shake: Dec 27, 2011

namlosh posted:

I�m not super advanced with Linux or python. I know my way around both though... and I�m curious, why would you want to do this
A) in python
B) at all

Reading stdin and pushing to stdout that is. Isn�t that like, typing on a screen?

Excuse me is this is a really dumb question... I mean no offense, I simply don�t know

No offense taken. I built a tool like the pipe viewer (pv) tool that gives you insight into shell pipelines.

You use pv like this:

code:

$ export URL="https://releases.ubuntu.com/20.04.1/ubuntu-20.04.1-desktop-amd64.iso"
$ http "$URL" | pv > /dev/null
51.0MiB 0:00:06 [10.3MiB/s] [     <=>                                         ]

Where pv measures how fast and how much data was piped into stdin and then piped back out as it happens. My tool does something similar, and is intended to be used in a shell pipeline.

The faster I can redirect stdin to stdout, the better, which is why I figure I can swap file descriptors using os.dup2() because that's what Bash essentially does behind the scenes when you do I/O redirections.

salisbury shake fucked around with this message at 21:51 on Nov 10, 2020

# ? Nov 10, 2020 21:49

DearSirXNORMadam: Aug 1, 2009

OnceIWasAnOstrich posted:

You have to close the pool one way or another, it won't happen if it is an attribute of an object that gets garbage collected. The easiest way is to open pools with context managers and the with keyword.

This means you can re-use a pool like you mention to save spinning up new processes but you need to keep track of it and ensure you close it eventually. If your parallel processes take long enough just re-make the pool, but if spinning up a new pool takes enough time it is worth it to keep it around, especially if you use a length pool intitializer function. You could create the pool in an outside context with a context manager and have the entire lifecycle of your object happen inside of that I suppose.

So what about the destructor strategy? Like if I just add pool.terminate() to __del__ are there potential issues with that? (I guess potentially if __del__ doesn't get called, which I assume can arise from crashes or Exceptions or something?)

# ? Nov 10, 2020 22:27

bigperm: Jul 10, 2001; some obscure reference

Going through an openpyxl tutorial and have come across something I don't even know how to google.

I'm just opening an excel file and putting the rows into JSON. If the last print statement is indented, I get everything just like I want.

code:

import json
from openpyxl import load_workbook

workbook = load_workbook(filename="dataset.xlsx")

sheet = workbook.active

orders = {}


for row in sheet.iter_rows(min_row=2,max_row=15,values_only=True):
 
    order_id = id(row)

    order_set = {
        "Item Number": row[0],
        "Item Description": row[1],
        "QTY": row[2],
        "Requested Ship Date": row[3],
        "Item Class Code": row[4],
        "Customer Number": row[5],
        "Profile": row[6],
        "Species": row[7],
        "Color": row[8],
        "Customer PO Number": row[9]
    }
    
    orders[order_id] = order_set
    print(json.dumps(orders,indent=4))

However if that last print statement is not indented, I only get two rows worth of data printed out. Anyone who could point me in the direction of why would be greatly appreciated.

# ? Nov 11, 2020 02:02

namlosh: Feb 11, 2014; I name this haircut "The Sad Rhino".

salisbury shake posted:

No offense taken. I built a tool like the pipe viewer (pv) tool that gives you insight into shell pipelines.

You use pv like this:
code:
$ export URL="https://releases.ubuntu.com/20.04.1/ubuntu-20.04.1-desktop-amd64.iso"
$ http "$URL" | pv > /dev/null
51.0MiB 0:00:06 [10.3MiB/s] [     <=>                                         ]
Where pv measures how fast and how much data was piped into stdin and then piped back out as it happens. My tool does something similar, and is intended to be used in a shell pipeline.

The faster I can redirect stdin to stdout, the better, which is why I figure I can swap file descriptors using os.dup2() because that's what Bash essentially does behind the scenes when you do I/O redirections.

Awesome! I get it now... makes total sense, thanks for taking the time to explain it to me.

^^^^^^^^^^^ the post above smells like a closure issue, maybe?

# ? Nov 11, 2020 02:29

a foolish pianist: May 6, 2007; (bi)cyclic mutation

bigperm posted:

Going through an openpyxl tutorial and have come across something I don't even know how to google.

I'm just opening an excel file and putting the rows into JSON. If the last print statement is indented, I get everything just like I want.
code:
import json
from openpyxl import load_workbook

workbook = load_workbook(filename="dataset.xlsx")

sheet = workbook.active

orders = {}


for row in sheet.iter_rows(min_row=2,max_row=15,values_only=True):
 
    order_id = id(row)

    order_set = {
        "Item Number": row[0],
        "Item Description": row[1],
        "QTY": row[2],
        "Requested Ship Date": row[3],
        "Item Class Code": row[4],
        "Customer Number": row[5],
        "Profile": row[6],
        "Species": row[7],
        "Color": row[8],
        "Customer PO Number": row[9]
    }
    
    orders[order_id] = order_set
    print(json.dumps(orders,indent=4))
However if that last print statement is not indented, I only get two rows worth of data printed out. Anyone who could point me in the direction of why would be greatly appreciated.

I'm not entirely sure, but I'd guess that the rows are getting assigned the same id. Try indexing with an int and incrementing it for your dictionary key unless the id() function is important for some other reason.

# ? Nov 11, 2020 02:37

bigperm: Jul 10, 2001; some obscure reference

a foolish pianist posted:

I'm not entirely sure, but I'd guess that the rows are getting assigned the same id. Try indexing with an int and incrementing it for your dictionary key unless the id() function is important for some other reason.

That was it. Thank you.

# ? Nov 11, 2020 02:49

Foxfire_: Nov 8, 2010

salisbury shake posted:

The above is not only slow, it maxes out my CPU at only something like 150MB/s.

I would not be surprised if you are just running into python being slow as a limit. You are doing:
- Lookup what 'next' means on the stdin.buffer object - Call it, it reads data into an internal buffer - Allocate a new python object and copy the bytes into it (I think you are at least skipping the bytes -> UTF-8 conversions the way you are doing it) - Lookup what 'write' means - Call it, copy the data into another internal buffer (occasionally flush that buffer to the OS) - Deallocate the python object - Loop

The shell pipe redirections are just renaming things so that the output of the first and the input of the second are the same thing. You can't introduce yourself in the middle of that because there is no middle, they're one file with two names.

If you want to insert your program into a pipeline and don't care about anything besides counting, the fastest way will be to use os.open() / os.read() / os.write() and a decent sized buffer size to skip as much python as possible. You'll still be allocating&deallocating python objects, copying data, and doing dictionary lookups for every chunk though, so it'll be slower than a C one

Mirconium posted:

For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations.

Be aware that by default, python on unix multiprocessing does invalid things. It will fork() to create copies of the existing process, then try to use them as pool workers, which violates POSIX. It will usually happen to work for most common libc implementations as long as absolutely everything in the process is single threaded. multiprocessing.set_start_method('spawn') will fix it.

# ? Nov 11, 2020 04:19

OnceIWasAnOstrich: Jul 22, 2006

Mirconium posted:

So what about the destructor strategy? Like if I just add pool.terminate() to __del__ are there potential issues with that? (I guess potentially if __del__ doesn't get called, which I assume can arise from crashes or Exceptions or something?)

Destructor would be fine if the object gets garbage collected properly. If you have a crash at the wrong time your pool will hang around afterward regardless of any of this (you will just be slightly more likely for this to happen if it is alive for the entirely script lifetime instead of just during computation). I don't remember clearly what guarantees CPython has wrt to garbage collection and exceptions but you would still probably want to wrap your function with a context manager and use __exit__ instead, or use a try/finally block.

OnceIWasAnOstrich fucked around with this message at 15:31 on Nov 11, 2020

# ? Nov 11, 2020 15:27

Qtamo: Oct 7, 2012

Pandas question: how can I split a column filled with strings into multiple rows by character count? I've got a dataframe that needs to get exported into an xls to be used as an import into an ancient system that limits cell character count to 100 characters. Since the strings are sentences, I'd prefer to split by the whitespace right before hitting 100 characters, but I haven't found a solution. It doesn't need to be efficient, the dataframe is only ~200 rows and probably somewhere around 1000 rows or less after it's been split.

Basic idea:

code:

Initial table
Name			String
Jimmy			300 characters
John			200 characters
Jane			100 characters

code:

Result
Name			String
Jimmy			100 characters
			100 characters
			100 characters
John			100 characters
			100 characters
Jane			100 characters

# ? Nov 11, 2020 16:11

Zugzwang: Jan 2, 2005; You have a kind of sick desperation in your laugh.; Ramrod XTreme

Qtamo posted:

Pandas question: how can I split a column filled with strings into multiple rows by character count? I've got a dataframe that needs to get exported into an xls to be used as an import into an ancient system that limits cell character count to 100 characters. Since the strings are sentences, I'd prefer to split by the whitespace right before hitting 100 characters, but I haven't found a solution. It doesn't need to be efficient, the dataframe is only ~200 rows and probably somewhere around 1000 rows or less after it's been split.

Basic idea:
code:
Initial table
Name			String
Jimmy			300 characters
John			200 characters
Jane			100 characters
code:
Result
Name			String
Jimmy			100 characters
			100 characters
			100 characters
John			100 characters
			100 characters
Jane			100 characters

One simple way would be to use df.iterrows(), and fill a list (let's call it row_list) with each row you want into a dictionary with "Name and "String" keys.

At the first entry for Jimmy, split the string into a list of however many 100-char strings, and append a dictionary with {"Name": "Jimmy", "String": [first 100-char string]} to row_list.

Then loop over the remaining strings, appending a dictionary to row_list with {"String": [next 100-char string]} until the strings are exhausted.

Finally, make a df with DataFrame(row_list, columns=['Name', 'String']).

Should work fine. Looping over dfs isn't efficient but as you said, it's only a few hundred lines so whatever.

Edit: like this.

code:

row_list = []
for _, row in df.iterrows():
    split_string = [row['String'][i: i+100] for i in range(0, len(row['String']), 100)]
    for i, chunk in enumerate(split_string):
        name = row['Name'] if i == 0 else ''
        row_list.append({'Name': name, 'String': chunk})
df2 = DataFrame(row_list, columns=['Name', 'String'])

Zugzwang fucked around with this message at 16:59 on Nov 11, 2020

# ? Nov 11, 2020 16:25

Head Bee Guy: Jun 12, 2011; Retarded for Busting; Grimey Drawer

I�m trying to learn Python for data journalism, and I was wondering if anyone had any resource suggestions. I�m looking for a basic curriculum that I can follow, test my working knowledge, and track my progress.

I�ve been dabbling in python4everybody, but I was wondering if there were other suggestions.

# ? Nov 11, 2020 18:27

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Head Bee Guy posted:

I�m trying to learn Python for data journalism, and I was wondering if anyone had any resource suggestions. I�m looking for a basic curriculum that I can follow, test my working knowledge, and track my progress.

I�ve been dabbling in python4everybody, but I was wondering if there were other suggestions.

I dont have a singular resource but I have some tips:
It seems like your goal would be to make great visualizations of data. I'd pick a project I want to write about and can get data for and I'd start googling how to do what I want.

As a hard learned tip, even 3 years in, I STILL find matplotlib cumbersome. Many python libraries are wrappers on it. Here's some of the options you might want to get familiar with.

If your goal is to make web apps for journalism, the easiest way if youre just starting out and only know python is Plotly Dash IMO. Bonus that it is built by the same people who make plotly so it does that stuff very easily.

# ? Nov 11, 2020 23:55

DearSirXNORMadam: Aug 1, 2009

CarForumPoster posted:

I dont have a singular resource but I have some tips:
It seems like your goal would be to make great visualizations of data. I'd pick a project I want to write about and can get data for and I'd start googling how to do what I want.

As a hard learned tip, even 3 years in, I STILL find matplotlib cumbersome. Many python libraries are wrappers on it. Here's some of the options you might want to get familiar with.

If your goal is to make web apps for journalism, the easiest way if youre just starting out and only know python is Plotly Dash IMO. Bonus that it is built by the same people who make plotly so it does that stuff very easily.

People with more experience can chime in, but as someone who self-taught a lot of programming, yeah, pick a project first, it's SO much easier to motivate yourself to learn if you have a limited set of objectives that you want to achieve for reasons beyond "I want to know a thing". Trying to digest the entire python syntax and standard library is going to be heavily besides the point for you, because a lot of it is built to do programmer things instead of data processing things.

Also python visualizations are all awful, especially for making actual nice looking plots that do unusual formatting, as presumably would be needed in data journalism, DOUBLE especially for making them HTML-friendly. I personally have had good luck with auto-generating javascript and html5. It's an added layer of learning curve, but when you get good at python, remember that as an option.

# ? Nov 12, 2020 23:29

OnceIWasAnOstrich: Jul 22, 2006

Mirconium posted:

Also python visualizations are all awful, especially for making actual nice looking plots that do unusual formatting, as presumably would be needed in data journalism, DOUBLE especially for making them HTML-friendly. I personally have had good luck with auto-generating javascript and html5. It's an added layer of learning curve, but when you get good at python, remember that as an option.

I can't really agree with this, although default matplotlib and some of its wrappers can be awful especially if you want non-raster renderers. There are plenty of nice HTML-friendly ways to do very nice visualizations in Python including but not limited to Plotly and Bokeh. Rolling your own Javascript and HTML generation seems like an amazing amount of work for something that is probably going to be uglier and way harder to use than the Python plotly.js interface.

Dash/Plotly is a great resource for data journalism-type stuff where you want fancy/nice/interactable/web-friendly visualizations.

# ? Nov 12, 2020 23:42

salisbury shake: Dec 27, 2011

Foxfire_ posted:

If you want to insert your program into a pipeline and don't care about anything besides counting, the fastest way will be to use os.open() / os.read() / os.write() and a decent sized buffer size to skip as much python as possible. You'll still be allocating&deallocating python objects, copying data, and doing dictionary lookups for every chunk though, so it'll be slower than a C one

Thanks, I tried this out and hit 3GB/s with a 64KB buffer.

code:

from os import read, write

KB = 2 ** 10
CHUNK = 64 * KB

while data := read(0, CHUNK):
  write(1, data)

And I hit 3GB/s using the stdin/stdout handles, too.

code:

from sys import stdin, stdout

while data := stdin.buffer.read(CHUNK):
  stdout.buffer.write(data)

# ? Nov 13, 2020 06:01

Qtamo: Oct 7, 2012

Zugzwang posted:

One simple way would be to use df.iterrows(), and fill a list (let's call it row_list) with each row you want into a dictionary with "Name and "String" keys.

At the first entry for Jimmy, split the string into a list of however many 100-char strings, and append a dictionary with {"Name": "Jimmy", "String": [first 100-char string]} to row_list.

Then loop over the remaining strings, appending a dictionary to row_list with {"String": [next 100-char string]} until the strings are exhausted.

Finally, make a df with DataFrame(row_list, columns=['Name', 'String']).

Should work fine. Looping over dfs isn't efficient but as you said, it's only a few hundred lines so whatever.

Edit: like this.
code:
row_list = []
for _, row in df.iterrows():
    split_string = [row['String'][i: i+100] for i in range(0, len(row['String']), 100)]
    for i, chunk in enumerate(split_string):
        name = row['Name'] if i == 0 else ''
        row_list.append({'Name': name, 'String': chunk})
df2 = DataFrame(row_list, columns=['Name', 'String'])

Thanks for this. I'd read the warning in the pandas docs about not modifying something I'm iterating over and for some reason it didn't occur to me to just throw the stuff into a new dataframe, so I avoided iterrows altogether :doh:

# ? Nov 13, 2020 10:07

Zugzwang: Jan 2, 2005; You have a kind of sick desperation in your laugh.; Ramrod XTreme

Qtamo posted:

Thanks for this. I'd read the warning in the pandas docs about not modifying something I'm iterating over and for some reason it didn't occur to me to just throw the stuff into a new dataframe, so I avoided iterrows altogether

Glad to help! You could also just reassign df to the new DataFrame, i.e. use df = DataFrame(args) at the end. That's not the same as avoiding the modification of something you're iterating over -- an actual example of that would be deleting a dictionary key while iterating through dict.items() or something.

# ? Nov 13, 2020 18:25

a foolish pianist: May 6, 2007; (bi)cyclic mutation

Qtamo posted:

Pandas question: how can I split a column filled with strings into multiple rows by character count? I've got a dataframe that needs to get exported into an xls to be used as an import into an ancient system that limits cell character count to 100 characters. Since the strings are sentences, I'd prefer to split by the whitespace right before hitting 100 characters, but I haven't found a solution. It doesn't need to be efficient, the dataframe is only ~200 rows and probably somewhere around 1000 rows or less after it's been split.

Basic idea:
code:
Initial table
Name			String
Jimmy			300 characters
John			200 characters
Jane			100 characters
code:
Result
Name			String
Jimmy			100 characters
			100 characters
			100 characters
John			100 characters
			100 characters
Jane			100 characters

Stupid simple version:

code:

>>> import pandas as pd
>>> data = pd.DataFrame()
>>> data['name'] = ['x','y','z']
>>> data["text"] = [["a", "b", "c"], ["a", "b"], ["a"]]
>>> data
  name       text
0    x  [a, b, c]
1    y     [a, b]
2    z        [a]
>>> data.explode("text")
  name text
0    x    a
0    x    b
0    x    c
1    y    a
1    y    b
2    z    a
>>>

You need to split your text column into a list of strings with the appropriate length before you do the .explode(), of course.

# ? Nov 13, 2020 19:31

Zoracle Zed: Jul 10, 2001

I'd recommend the grouper iterator but god drat it's annoying itertools has a "recipes" section in the documentation instead of just putting code in the library where it'd be useful

# ? Nov 13, 2020 19:43

salisbury shake: Dec 27, 2011

more-itertools makes the the grouper recipe available.

# ? Nov 13, 2020 19:52

OnceIWasAnOstrich: Jul 22, 2006

Zoracle Zed posted:

I'd recommend the grouper iterator but god drat it's annoying itertools has a "recipes" section in the documentation instead of just putting code in the library where it'd be useful

The number of times I have had to Google and copy-paste the exact same function off of either Stackoverflow or the itertools doc depending on which ones shows up first is just so frustrating. Put it in the drat library already. I don't need the best way to write that that memorized taking up space in my brain. Who do we need to bother to make this happen?

I know I can install more-itertools or whatever but I don't want a whole extra dependency when that is an incredibly common and simple need that would fit well in itertools.

# ? Nov 13, 2020 19:53

Qtamo: Oct 7, 2012

Thanks for all the suggestions - I ended up making my own solution before reading up on the replies since my last post, so here's a horrible version building on Zugzwang's initial reply and some applied stackoverflow (warning, this might be really ugly/stupid):

code:

finalrows = []
for _, row in df.iterrows():
    initialsplit = iter(row['Seliteraw'].split())
    lines, current = [], next(initialsplit)
    for word in initialsplit:
        if len(current) + 1 + len(word) > 100:
            lines.append(current)
            current = word
        else:
            current += ' ' + word
    lines.append(current)
    for i, chunk in enumerate(lines):
        id = row['id']
        id2 = row['id2']
        finalrows.append({'ID': id, 'ID2': id2, 'Text': chunk})          
df2 = pd.DataFrame(finalrows, columns=['ID', 'ID2', 'Text'])

Afterwards df2 gets merged into the original df since it holds other stuff I need - there's probably a way of doing this neatly in place, but this seems to work well enough. Feedback's still welcome, since I'm probably doing something in a really stupid way, being a Python beginner and all.

# ? Nov 16, 2020 12:01

Loezi: Dec 18, 2012; Never buy the cheap stuff

I'm reading up on coroutines but getting tripped over by what's the current preferred method of doing stuff.

Various older tutorials are full of examples like

Python code:

def bare_bones():
    print("My first Coroutine!")
    while True:
        value = (yield)
        print(value)

coroutine = bare_bones()
next(coroutine)
coroutine.send("First Value")

But more recent PEPs are full of await and __aiter__and async def and all this stuff, and the asyncio module says "Support for generator-based coroutines is deprecated and is scheduled for removal in Python 3.10." which I'm not clear on whether it applies to just that module or the whole concept and it's a whole big mess.

Does anyone have a good, up to date tutorial on this stuff?

# ? Nov 16, 2020 14:21

Adbot: ADBOT LOVES YOU

# ? May 15, 2024 01:11

Jakabite: Jul 31, 2010

So I've been following a tutorial for making a basic space invaders type game with pygame. I decided to also start doing my own thing in parallel to exercise the old creative muscles too - it's all going fine so far except for 'blitting' the ship. Everything, as far as I can tell, is the same as the other project which does work. I can post the whole code if that would help but for now I'll stick to the line giving me trouble:

self.screen.blit(self.image, self.rect)

This is the only line in the blitme() method of the Ship class. I've tried setting a breakpoint when the method is called, and it seems to call fine from the main code. self.screen's properties correspond to the surface that the whole game runs on, self.image has the correct properties that it should have after loading the image I'm using for the ship, and self.rect also has the correct properties of the rect of the ship image. I even tried changing the x and y attributes of the rect to (100, 100) to absolutely ensure it was within the borders of the screen. When I run the program, however, it doesn't blit the ship. It doesn't throw any errors or anything, the ship just never appears. I've tried changing the image used for the ship to something much bigger and more obvious too. Nothing. Any ideas?

E: self.screen.fill is also not doing anything. It's just showing a black screen no matter what values I pass. When I print self.screen it does show up as a surface with the expected dimensions though. It's like anything to do with self.screen just... isn't happening?

E2: Forgot to add 'pygame.display.flip()' in my update_screen method. Leaving this here as a lesson to similar fools.

Jakabite fucked around with this message at 22:17 on Nov 16, 2020

# ? Nov 16, 2020 20:46

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »