Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
CarForumPoster
Jun 26, 2013

⚡POWER⚡
Question: What is the proper way to break out these email addresses into separate rows?

I get a response from an API when I lookup an email address for someone formatted like this:

code:
[Email(valid_since=datetime.datetime(2013, 5, 14, 0, 0), last_seen=datetime.datetime(2018, 8, 23, 0, 0), type_='personal', email_provider=True, address='greatemailname@gmail.com', address_md5='11137bc04acd3df7974979429e9ed15c'), Email(valid_since=datetime.datetime(2008, 4, 9, 0, 0), last_seen=datetime.datetime(2017, 12, 1, 0, 0), type_='personal', email_provider=True, address='greatemailname6@yahoo.com', address_md5='1114e2178a3ddaa46405811db44b58db')]
I stick this into its own cell in a dataframe which has the format: UserName | APIRespEmailString

I will eventually send this person an email at both emails and will need to break this into one line, one email like:

UserName1 | greatemailname@gmail.com
UserName1 | greatemailname6@yahoo.com
UserName2 | Email1
UserName2 | Email2
UserName2 | Email3
Username3 | Email1

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Bundy posted:

Can we not goon up what had until now been a nice, interesting and informative thread, with a stupid slapfight thanks just thanks.

Person who posted the screenshot stuff, I found it very interesting and always like to find more performant ways of doing things thank you.

This, I really enjoyed the MSS chat as I've used both OpenCV w/python and pyautogui but hadn't heard of MSS

CarForumPoster
Jun 26, 2013

⚡POWER⚡

cinci zoo sniper posted:

PyCharm 2019.1 is out. Summary - revamped Jupyter integration; improvements for data classes, debugging, type checking, pytest.

The jupyter changes look good. Jupyter notebooks being better in firefox reallydecreased my use of pycharm even though, for .py files, I really liked pycharm.

https://www.youtube.com/watch?v=TIZH4aPSN2E

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Boris Galerkin posted:

Speaking of data, I'm looking to store several GB of CSV data in a single compressed HDF5 file and I was wondering what package I should use for that. Right now I've found h5py, pytables, and there's also xarray I guess. Are there any pros/cons for any of these?

Why not just use pandas?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Boris Galerkin posted:

A few weeks ago Mozilla introduced Iodide which is some kind of notebook-like clone but for Javascript and I think one of the selling points is that it runs entirely in your browser, no external server needed.

Yesterday they introduced Pyodide which as you probably guessed in Iodide but with Python, and it also runs entirely in your browser.


Here's an example notebook (and here's another one).


It seems neat at first glance but I can't really imagine a scenario where the running in my browser thing would be something I'd use. If I'm on a laptop then spinning up a jupyter notebook server takes no effort, so for me the useful thing would be if this could be run on like an iPad.

Speaking of iPads, I've been using jupyter notebooks with the "Juno" app and running them on Azure Notebooks and it's pretty great.

Why code on an ipad?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Umbreon posted:

Oh I agree, I'm only asking about the monthly sub, I wouldn't get anything longer than that. I'm saying that even $40 a month feels really expensive for something like this, I just want to make sure it's worth it.

(as things stand, it feels like it's worth it, but I've only done the first few lessons so I want to make sure it doesn't lower in quality later on or anything like that)

If you're using it to do personal projects maybe its a little costly...I'd probably just move on to projects from there. (e.g. if you want to learn python to do data science/deep learning, go here, it is free: https://course.fast.ai/videos/?lesson=1)

If you're trying to do your current job better get your company to pay for it
If you're trying to do a new job it is dirt cheap relative to a code bootcamp or other more formal training scenario

CarForumPoster
Jun 26, 2013

⚡POWER⚡

KICK BAMA KICK posted:

A simple dict mapping strings to strings (can guarantee they're short, alphanumeric, no whitespace) that I want to read/write from disk -- is pickle the best option or would you rather write it as a text file? Few hundred entries at most, accessed/modified a only few times a day. Portability between implementations and platforms would be a huge plus.

I do a lot of pickle reading/writing and csv reading/writing with the pandas implementation of each. Pickle is an order of magnitude faster, would take a pickle every time. (Bonus that it handles mixed data types)

CarForumPoster
Jun 26, 2013

⚡POWER⚡

punished milkman posted:

Anyone have any package suggestions for extracting tables of data from image files (.png/.jpg) ? I tried using Tesseract/pytesseract and while it's doing a great job of detecting the text, the tabular aspect of it is totally lost and I couldn't find a straight forward path to processing tables with it. I've used Camelot with PDFs before, and it worked OK (at best), but I'm hoping to use something else this time around.

Extortionist posted:

This isn't an easy problem. If the images are fairly consistent you can try using one of the tesseract outputs that supplies word coordinates and do your own table determination based on the relative positions of words. It might also be useful to run the images through opencv first to extract the positions of the lines (possibly also removing them from the image, or splitting into several small images prior to OCR).

You might look at AWS Textract (still in preview) or the Google/Azure OCR services, too, if someone's paying for it.

Chiming in to say Docucharm is pretty good at this if theyre of a vaguely consistent format. For example I needed to typed and printed-> scanned reports into structured data. I did a compare of them to Textract (which I got access to early) and they were much better. Havent used either in about 4 months so cant say if either has made large progress.

CarForumPoster fucked around with this message at 22:11 on May 18, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Hughmoris posted:

I need some advice on how to approach a problem at work.

We have a rule engine that processes incoming orders. A new order comes in with a list of properties, and depending on what those properties and values are, the rule engine directs the order down a specific node to an action. Behind the GUI editor, the rule engine is essentially an XML file.

The problem is analysts can tweak the rules in order to account for a new type of order that will be processed, or to edit the current routing of a particular order. We don't have a good way of testing after an analyst makes a configuration change. It's easy to test their specific change (by submitting that specific order and sees if behaves as intended), but it could have unforseen consequences to the routing of other order types.

I'd like to create something that would allow for automated testing. I'm not sure how to start, given the XML file that represents the current conditional rules. Maybe recreate the entire XML file in Python as conditional statements? Create test orders that are essentially a JSON object {'Order123': {'Type:Sports', 'Method:Online', 'Cost:$530'}}. Then run that order through my test script before an analyst makes changes to the engine. Then run the test script again post configuration change, and verify it's the same result.

Any thoughts/advice on how I could try tackling this?

Why not just have a set of orders that get run through each time a configuration change is made?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Hughmoris posted:

That was my initial thought, too. I could have a defined order set, place the orders, then check what the final destination is for each order pre and post rule change.

Unfortunately, there isn't a clean way to automate data pulls and extractions after orders are ran through the system. Would have to use the system's search engine to run a manual search, then export the results to CSV files, then compare said CSV files. Going that route will definitely be better than what we have (nothing).

Would be nice to automate everything but I'm stuck on what recreating that XML in Python would look like.

If you can really reliably find them by searching you could use python/selenium to get the data out the other side automatically.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Going to start my first django project this week. If I like pycharm, is it worth getting the professional version over community because of the wed development related features? If this project goes well I expect to be using it daily for several weeks at least.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

unpacked robinhood posted:

Python code:
import pendulum as pdu

d1 = pdu.parse('2019-01-01')
d2 = pdu.parse('2019-05-31')
inside = pdu.parse('2019-04-20')
outside = pdu.parse('2008-04-20')

print(d1<inside<d2) # True
print(d1<outside<d2) # False
Seems ok ?

I had never heard of pendulum until now and have definitely been bitten by datetime issues or wrote too complicated of code for that kind of bs. Absolutely fantastic.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Empress Brosephine posted:

So I finished Python Crash Course and loved it; what should I read next to improve my skills? Is it worth learning more than the blade level of skills with Flask?

Thanks all.

Find a project you want to do to solve some problem and do it.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

General_Failure posted:

I may have been hit by a pip typesquatting attack.

I only just noticed it. There was a file in my home called "typesquating-attack" with that spelling. The contents said something like "You have been hacked since" followed by the date and time. April 4th. I just shut it down after seeing that. I have a suspicion it may have been from a package I accidentally installed and removed shortly after called "tensoflow". Any tips on this?

e: looks like the package was removed from pypi

Thanks for posting about this, I didnt know this was a thing.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

FCKGW posted:

I learned python in community college a few years back and I'm going to be going back to school soon and would like to get a refresher course.

Anyone have any online video series they could recommend? Something from Coursersa or Edx or Codeacademy, etc..?

I like codecademy but even more than that I like picking something you want to make and making it. You'll pick it up quickly.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
TLDR: Is pywin32 the only/easiest way to print PDFs with python on windows?

——

I want to print a few hundred PDFs per week in a specific order from a specific copier and with alternating printer properties on windows. Some are hole punched with paper from tray one, some are b&w only from tray 2. Batch printing greatly slows down the manufacturing process that follows printing (assembling documents and mailers)

Pywin32 looks complicated as gently caress for what should be a trivial problem. The “shell” method looks like it wouldn’t take enough arguments leaving me reading MSDN docs and going very low level. Has there really been no progress since 2010 or so?!

Edit: I only have about 4 configs and 1 printer on 1 port I’m concerned with so I am going to try installing the same printer 4 additional times with unique names and default settings. Not an elegant pythonic solution but should work.

CarForumPoster fucked around with this message at 22:36 on Jul 6, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

QuarkJets posted:

Have you looked at pkipplib or win32print?


Thermopyle posted:

If you're asking about printing PDFs that already exist rather than actually creating PDF files with python, I'd just look into generic windows ways to send a file to the printer. Hell, a quick google shows some results for "windows print pdf from command line", which you could just do from python.

Yea the PDFs are already created with python. (A report, an envelope and a shipping label) Win32print is the pywin32 method that gets hilariously complicated quick. Sending a PDF to a printer that’s already configured is pretty easy. The problem is configuring the printer to do what I want (eg hole punch) from Python. I think the solution is “install 4 copies of the printer driver with different default configs” but can’t test until Monday.

CarForumPoster fucked around with this message at 01:05 on Jul 7, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Appreciate your guy's help,

Thermopyle posted:

If you're asking about printing PDFs that already exist rather than actually creating PDF files with python, I'd just look into generic windows ways to send a file to the printer. Hell, a quick google shows some results for "windows print pdf from command line", which you could just do from python.

CarForumPoster posted:

I am going to try installing the same printer 4 additional times with unique names and settings. Not an elegant pythonic solution but should work.

This worked and is dead simple and pretty reliable. My printer or print spooler doesnt necessarily honor the order sent but a little bit of waiting fixed that.

CarForumPoster fucked around with this message at 12:39 on Jul 9, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡
What I want to do: Convert a .docx file containing comments to pdf, displaying the Word comments in the PDF.

Why:I have a Flask web application on Heroku where a user uploads a .docx which is sent to S3 and the app adds comments to the .docx, then returns the PDF which is displayed in the browser.

The problem: Heroku runs on linux and LibreOffice doesnt print the PDF comments in a way thats pretty like Word on Win10 does, it instead embeds them as PDF comments.
In Win10 Word O365, this is trivial. You save it, which you can do easily from python.

I'm looking for the simplest/best solution to output a PDF with comments that looks like it does with Word.

A few things I'm considering:
-Have an Amazon Workspaces Win10 w/O365 running basically as a server and a Python script that somehow gets the files from S3, converts it on windows and returns the PDF to S3 and notifies my web app that the file is available for download. Not sure how to do this but seems possible.
-Trying a bunch of other Linux DOCX->PDF solutions to see if they're any better.
-Try installing MS Word on Heroku. Not sure if possible. (Update: It's Not)

EDIT:
Some better ideas possibly:
-Use MS Office Online to convert the Word Doc
-Find an API to do it for me.


EDIT2: Finding an API seems like the best solution but gently caress me if they dont suppress the comments output. Tried 3 so far. One works, but the docs are confusing. 2 Don't work.

EDIT3:
Solved it using Zamzar. Just so happens thats how their DOCX-PDF works and their docs are great.

CarForumPoster fucked around with this message at 20:00 on Jul 9, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Thermopyle posted:

What you're looking for is called a task queue.

The canonical solution is called Celery. However, it's very configuration-heavy because it's Enterprise Grade.

95% of the time I prefer python-rq. It's simple with fine docs.

Behind the scenes the basic idea works like this:

You have a server like Redis running. Your webserver python code puts a message into Redis. Your task queue python code sees that message and runs the tasks you've configured to run when such a message appears.

Me and another dev just made babbys first deployed web app and this is exactly what we did for a function that takes about 3 minutes to run. As a pro tip on RQ/Redis/Flask/Dash combo it doesnt play nice on windows 10. The worker.py file we had to grab poo poo out of the queue didnt work so stuff just stacked up in redis.

The front end "web" worker times out after 30s on heroku so we also had to figure out how to not use a while loop to ask the queue if our jobs were done yet. Still kinda working on that last bit but it was confusing for me for a while.

CarForumPoster fucked around with this message at 02:58 on Jul 19, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Thermopyle posted:

This reminds me of when I was first getting into web stuff...it was very confusing and nebulous and magical for a long time to me.

This is me

CarForumPoster
Jun 26, 2013

⚡POWER⚡
In case anyone else had this problem...I started to type

quote:

Is there a better way than a file I git ignore to store secrets like API keys and what not? Like maybe a AWS service that can only be accessed by whitelisted IPs

I'd like to outsource some web scraping tasks that are behind logins but I dont want to give all of our API keys and passwords to every developer. I can certainly change them after but it might be a few months before the projects are done.


But then I googled like a good boy and there is and it works fine through boto3:
https://aws.amazon.com/blogs/aws/aws-secrets-manager-store-distribute-and-rotate-credentials-securely/

CarForumPoster
Jun 26, 2013

⚡POWER⚡
If you have a helpful link as an answer, that'd be great, don't feel like you need to give me a detailed, personalized answer.

My Problem: I'm starting to have more and more python/selenium based web scrapers that get very similar data from different gov websites.

For example: I scrape the same data using 6 templates from 6 different government websites and format all the data to go in the same google sheet. Right now, I can run all of these these scrapers in about 10-15 minutes locally. We're considering upping this to 50 government websites/day, which goes a bit beyond what I can run sequentially on my local machine. Instead I'd like to run them simultaneously and automatically, simply making a log so I can catch any errors. One of them will run in to errors about 1 in every 4 or 5 days, but with 5X the websites it'll likely happen daily for a while.

TLDR: Whats the current industry standard to scrape 50+ websites per day using separate processes/workers? Is it time to figure out AWS lambda? Whats the industry standard way to log this?

CarForumPoster fucked around with this message at 21:08 on Jul 28, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Thermopyle posted:

I don't think there's really an industry-standard way. It kind of just depends on what your infrastructure is like.

If I was looking to make something robust, but on a budget, I'd run an instance of Redis and then use python-rq.

Easy to set up, and easily distributed across as many machines as you'd like.

python-rq's documentation is easy to follow.

I have what I feel like is a stupid question...should I just make a function for each template then run it on AWS Lambda? Any real world experience as to why I'd I do that versus trying to figure out python-rq?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Thermopyle posted:

I've done both and both are pretty easy.

I don't really think there's any more "figuring out" with either of them.

Lambda cost money of course. Doing it with redis may or may not depending on your current infrastructure.

It's not terribly uncommon to find web sites blocking the IP ranges of Amazon, Digital Ocean, Microsoft, etc, to prevent automated stuff like this.

Thanks for the encouragement. That’s a good point regarding the ip ranges.

I have a hilarious amount of AWS credits, more than I can use before they expire. My current infrastructure is nothing literally all our data is on S3 or sharepoint. There’s one switch in our office for the Ethernet jacks in the walls.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

i vomit kittens posted:

I'm having some trouble working with datetime in a small app I'm creating. I'm able to convert a datetime object to a string using strftime, but when I try to convert the exact same string back into a datetime using strptime in order to search a database for it, I'm told that the formatting I'm using is not valid even though I literally copy/pasted the format from the strftime function.

The functions are:

new_ev.date_time.strftime("%-m/%d/%y %-I:%M %p")

datetime.strptime(date_time, "%-m/%d/%y %-I:%M %p")

The error I'm getting when the bottom one is called is "ValueError: '-' is a bad directive in format '%-m/%d/%y %-I:%M %p'"

Can anyone point out to me why I'm having this problem?

Poster above me hit it. I deal with a lot of date strings gotten from a variety of web scraping and API calls. Working with datetime is a bitch in that use case. I've come around to prefer pendulum. Best part is much of the syntax is the same, its just a little easier to use when dealing with date strings from the wild.

Solumin posted:

Can I ask why you're converting a datetime object to a string and back?

My use case for this is I get them from a bunch of formats and then I use those dates in one format to do things like send emails via boto3/Amazon SES that have the date included.

CarForumPoster fucked around with this message at 16:16 on Aug 4, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

unpacked robinhood posted:

What's the correct way to keep track of a list of files on disk in a db, with irregular user chosen names ?
I'm pulling attachments from emails and turning them into rows in a table.
There's a string column that could hold a path. Ideally I wouldn't change the schema.

I have this usecase and the thing that worked for me is sending the file to S3 or SharePoint and just storing the URL and whatever content I want in that row.

EDIT: I should mention, we also replaced the filenames with UUIDs when storing because we immediately ran into collisions with filenames.

CarForumPoster fucked around with this message at 02:04 on Aug 6, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

unpacked robinhood posted:

Thanks. Do you simply call uuidN() to generate a value and use it as filename ?

import uuid

job_id = str(uuid.uuid4())

Then we append the extension to the job_id.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

KICK BAMA KICK posted:

From a while back, about task queues for a web app

I looked at this and then I looked at Huey, which I had remembered someone mentioning here, and at first glance I thought Huey would work better for me out the box -- can use SQLite instead of running Redis, didn't require extra packages for scheduled/repeatable tasks or Django integration, also its homepage doesn't stuff all text into a one-inch wide strip down the middle of my screen for some reason -- and that was completely wrong? python-rq with rq-scheduler and django-rq actually whips rear end and easily does exactly what I want, and the django-rq-scheduler package I don't actually need to use in my project (basically just adds to the Django admin interface some screens for controlling tasks, but they're pretty limited and not super useful in my case), but reading its code did give me a roadmap to storing the task information as a database entry to enable stopping/starting/changing the interval of the task on demand, which is the thing I wanted to do in the first place and could not figure out how to make work with Huey (I'm 100% sure it's doable but it's just not the API Huey presents and you'd have to hack it together).

Just making a mental note to take your word for it next time!

This is helpful, I'm starting down the path of a Django project now that will send up with workers calling APIs and I'm pretty new to web app development.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

EVIL Gibson posted:

Mainly with GAN networks and running them in windows. If you install python normally, it needs to have a lot of environment settings set. Conda manages all of that and i can simply use one that works and clone it rather than hoping i changed all the settings correctly to use the python enviroment i freshly installed (problem with lots of applications both mac and pc where it leaves cruft behind )

venvs keep all that in one part and conda delivers and packages those enviroments quickly so i spend less time troubleshooting where everything goes wrong.

I dont do it professionally but I recall setting up Tensorflow (GPU version) and getting Cuda to work right on Win 10 at that time (~1 year ago) required a shocking amount of work compared to what I was expecting. I generally prefer conda installing for exactly the reasons above and pip install only when a conda version isnt available.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

i vomit kittens posted:

Does anyone have any recommendations for a good 2D drawing library? I'm using Pillow right now but it's kind of limited as far as shapes go. Aggdraw is pretty much exactly what I'm looking for, but it's Python 2 only.

I know almost nothing about this but the free vector drawing software Inkscape seems to use python for extensions. There may be an interface there that allows you rather robust software with python.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I'd like your experiences using jupyter notebooks for daily or every 30 min cron jobs.

We have some web scraper and API calls that happen at a certain time every day run on an Amazon Workspace.

We're about to refactor some important code related to this and are kinda half .py files and half .ipynb at this point. Generally the .ipynb files have the actual code a human would review and the .py files have scraper templates and 50+ helper functions in them. I really like the idea of using a ipynb for documentation.

What do you guys do? Anyone ditched .py files for ipynbs? We just started down this road, any thoughts on automatically scraping 10+ websites per day using jupyter notebooks?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Sad Panda posted:

I'm writing a piece of software which will check the contents of a folder for files and display whether they are found or not. It will be used to check if students have submitted their work.

[...]

I'm trying to decide the best way to implement the interface. I'll have the normal version of Python, but be unable to pip install any extra packages. I'm thinking that I either...
1) Use Tkinter (no experience)
2) Re-generate the HTML each time and make it automatically reload that page. Would something like <meta http-equiv="refresh" content="10"> refresh to the regenerated HTML?

IMO if you're doing this for students and make it good, and you're just their dumb ol teacher that built it, it'll probably get a few of them them interested in coding.

I would accomplish this in using Dash. Dash uses python (Flask) to make pretty Dashboards relatively easily. I'd have a function that does as you say continuously and plots the output.

There's lots of open source examples: https://dash-gallery.plotly.host/Portal/

And components, such as an upload button: https://dash.plot.ly/dash-core-components

...and even makes the HTML layout easy: https://dash.plot.ly/dash-html-components

If you like bootstrap, there's also this: https://dash-bootstrap-components.opensource.faculty.ai/l/components

CarForumPoster fucked around with this message at 13:18 on Sep 8, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Sad Panda posted:

That looks super pretty, but seems to involve pip install and I still don't think I've got pip install privs.

I was on a super locked down comp at a large defense contractor and could still pip install things. I had to use the proxy param. Can you browse to pypi.python.org in your browser?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Dominoes posted:

I don't get along well with Docker, and am looking for help on an open-source project. You might even be interested in using this later, if not now. What I'm specifically asking for help on is a proof-of-concept for getting official binaries hosted on python.org.

I’d like to suggest amazon workspaces free tier for windows. Takes about 4 seconds to set up.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Zerilan posted:

Humble Bundle has a python bundle again, https://www.humblebundle.com/level-up-your-python. I'm starting to apply again to some PhD programs for numerical analysis and machine learning, and Python commonly comes up as a desired language. I'm pretty used to doing math stuff in MATLAB and C, but less so in Python. Admittedly I'm a couple years out of my masters' now and had to settle for a job outside my field so pretty rusty on what I do know. How useful would the books/tools in this bundle be to self-teaching myself some python compared to whatever free online (or cheaper books) I could find out there?

IMO the best way to learn python is to find something you want to do and do it. There’s a stack overflow post or github pull request for literally everything.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I can't think of a super easy way to do this and I bet there is one.

I have a pandas df:
code:
id|name|attrs
123|bob|[cool,old,man]
456|dave|[uncool,old,man]
I want:
code:
id|name|attrs
123|bob|cool
123|bob|old
123|bob|man
456|dave|uncool
456|dave|old
456|dave|man
Whats the pythonic/pandas way to do this for a list (df['attrs'][index]) of arbitrary length?

CarForumPoster fucked around with this message at 14:19 on Sep 14, 2019

CarForumPoster
Jun 26, 2013

⚡POWER⚡

SurgicalOntologist posted:

There may be a shortcut or more clever way, but if you can assume that the list is the same length for all of them you can use the .str accessor to access list items (I never understood why this is is in the string accessor but it's handy).

Python code:
n = len(df.attrs.iat[0])

# Unpack each list item to its own column.
for i in range(n):
    df[f'attr{i}'] = df.attrs.str[0]

# Convert wide to long.
new_df = pd.melt(
    df.drop('attrs', axis='columns'),
    id_vars=['id', 'name'],
    var_name='attr_ix',
    value_name='attr',
)
This will include a column 'attr_ix' with values like ['attr1', 'attr2', ...], so if you want just add .drop('attr_ix', axis='columns').

I'm 95% sure this is correct but there may be something weird that happens with the indices or something, melt sometimes takes several tries to get right.

Edit: After writing this I googled it and found some clever solutions based on df.attrs.apply(pd.Series) which I didn't realize would work. That only replaces the loop in my code, although I think it would work better with varying-size lists.

Melt looks like what I want, never thought to describe it as unpivoting Thank you!

CarForumPoster
Jun 26, 2013

⚡POWER⚡

This owns, exactly what I want, thanks!

I love you thread, you're so much nicer to me than stack overflow.

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I love this thread

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply