Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Thermopyle posted:

I came across this article when Googling for something unrelated and I found it semi-interesting.

It's about how Instagram converted from Python 2.7 to Python 3.5. Took them a year to do it.

It's mostly just generalities, but there is this:

This is in-line with my experience on a couple of projects I've converted to Python 3.

This is good to know.

Also, hi thread. I am learning python to do some web scraping and data manipulation and eventually machine learning stuff.

Holy crap is it easy. I got all the data from a webpage into a csv with like 4 lines of code (pandas) and theres 19 bajillion examples of how to do this online.

CarForumPoster fucked around with this message at 11:43 on Oct 6, 2017

# ¿ Oct 6, 2017 11:19

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 20:29

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Hughmoris posted:

I'm working my way through Automate The Boring Stuff and am on the Web Scraping section. Just curious, with your pandas example, are you scraping full tables or are you using selectors to nab individual items and then building a dataframe?

I just started yesterday so right now I nab everything and dump it into a CSV.

Also, is there a machine learning thread? Im wondering if I could actually skip this step all together.

I have the problem of similar data like names, dates, events are stored in tables on lots of webpages in semi-different formats on each website. I want to scrape that data and have ~machine learning~(or whatever) sort it out for me such that it is stored in a way I can add to a database. (CSV or something like that) I have ~50 sample webpages I could format how I'd like (desired output) and use to train a model with and could easily get a few hundred if it was worth while.

I know I'd need a way to vectorize either the table info or the html all together...havent figured that out yet.

Machine learning thread?

CarForumPoster fucked around with this message at 17:29 on Oct 6, 2017

# ¿ Oct 6, 2017 17:25

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Cingulate posted:

You can try the data science thread, or the stats thread.

Though I'm not sure what you actually want. (Supervised) machine learning relates standardised data of one form (predictors) to standardised data of another form (outcomes). It seems to me you're describing a part of the data wrangling process still - although that too might belong into the data science thread.

Your understanding is correct. I want a webscraper more flexible than most (all?) web scrapers to scrape a certain type of data and put it in a database. (Dates and info usually stored in tables on many different webpage layouts)

I.e. I want a flexible data formatting tool. I want to use machine learning to do some data wrangling.

# ¿ Oct 6, 2017 21:33

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Dominoes posted:

Does anything in the OP look stale?

The Heroku link directs you to a login rather than info about heroku.

# ¿ Oct 8, 2017 14:05

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

FAGGY CLAUSE posted:

Not sure this is the exact thread for it, but I'm using Python to build out a prototype.

Long story short, I'm building out an document OCR process/pipeline to extract data from PDF documents. Many of these are just straight up scans of structured documents, hence the OCR bit. I'm using Tesseract for the moment and looking for suggestions on any other OCR solutions I can use. Cloud services are a no go. No Russian software companies either. Otherwise, the customer generally prefers buying commercial software in the end, but until then I just need to prove that this would be useful before we start buying things.

Anyone have OCR experience, particularly with structured forms, and recognizing data tables? I've been getting OK results for now. Tabula looks interesting. Tesseracts HOCR output format is a nice way to identify exact locations of each word. I can find fields by certain words/phrases that tend to get OCR'd the best and then locate text relative to these locations. For the tabular stuff I was considering even attempting some sort of clustering to see if that helped pull out wrapped text/phrases. I have some check boxes to pull out and have had some luck cropping with OpenCV and the counting the the % of black pixels vs white. But at this point I feel like I'm reinventing the wheel.

I don't have a lot of experience but also have this exact problem so I'm curious what you find out. Why no cloud services though?

# ¿ Dec 3, 2017 13:21

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

FAGGY CLAUSE posted:

Classified documents.

Are you this guy:
https://www.youtube.com/watch?v=h6TRYcx74qs

# ¿ Dec 3, 2017 14:47

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Sad Panda posted:

Very. I'll be logging in to each site and there's not going to be much standardisation.

There are updates every now and then but infrequently enough that once I get it all done at first I don't mind tweaking it when they change.

I do exactly this with a different, easier, non-python solution. For what industry and purpose are you doing this project?

# ¿ May 20, 2018 14:00

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Sad Panda posted:

It's personal use, no form of industry. I'm logging into betting websites. What's the easier option? I considered a Pyautogui setup which is just recording a series of clicks and using OCR + Image Recognition, but Selenium seemed like an interesting way to start.

I managed to get the login working on 145/155 websites. The other ones are throwing up Selenium errors. A few of them because elements are tagged as invisble so Selenium doesn't want to interact with them.

Some learning curve to this, decent documentation, and $150/month but many good features. It was much mroe robust and allowed faster deployment than trying anything with BeautifulSoup and the like.
https://www.parsehub.com/

EDIT: Its free to use with a limited feature set and what not but I ended up needing several of their not free features. E.G. I dont want my projects to become public.

CarForumPoster fucked around with this message at 18:36 on May 20, 2018

# ¿ May 20, 2018 18:28

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Short question: When you guys write python code, do you write a function under a def and then if __name__ = __main__ do the function you just wrote?

Reason:
I've been tinkering with python for a while now but am finally working on a little project (automated GUI/Web interaction using selenium or pyautogui to automate writing and uploading test reports) that needs me to write a bunch of little functions to call with a wrapper based on what I need to do that day. I wrote my code then when I was done, indented it and wrote def funcname(): followed by if __name__ = __main__ do the function. I feel like I should do this for everything I write. Do you guys do this this way?

# ¿ Jun 7, 2018 23:05

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I have a dumb newbie question but I feel like I'm missing something obvious.

I have a function to make QRcodes:

code:

def genqr(url="defaulturl", fname="defaultfname"):
    img = qrcode.make(url)
    img.save(fname)
    print('Generating QR code from URL: ', url, " Saved to", fname)

I have some excel data to pass to it. My intuition is to load this in a dataframe with:
df = pd.read_excel("data.xlsx"), giving:

code:

                     urls filenames  number
0  http://url.com/?n=1001  1001.jpg    1001
1  http://url.com/?n=1002  1002.jpg    1002
2  http://url.com/?n=1003  1003.jpg    1003
3  http://url.com/?n=1004  1004.jpg    1004

I want to loop through the urls generating QR codes and saving them with those file names. I think I am missing something obvious. I thought df.apply(qrgen.genqr, axis=1, urls, filenames) or something like that might work.

What do I need to be doing?

# ¿ Jun 10, 2018 00:49

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

huhu posted:

I don't see a loop anywhere? I can't be sure because I've not used the functions you're discussing but I think you're assuming something is iterating when it's not.

You might want to do something like (phone posting):
code:
for row in data: 
  genQR(row)

I thought df.apply was supposed to act like a loop

EDIT:

Thanks for the help,. I got it!

code:

import qrgen as qr
import pandas as pd

df = pd.read_excel("cases.xlsx")

for index, row in df.iterrows():
    qr.genqr(row[0],row[1])

CarForumPoster fucked around with this message at 01:18 on Jun 10, 2018

# ¿ Jun 10, 2018 01:03

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

SurgicalOntologist posted:

You could do it with apply too, just need to set up the function so it takes a row as input.

Im trying to learn the whole "stop using for loops" thing, can you give me an example of it with df.apply?

# ¿ Jun 10, 2018 03:34

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

SurgicalOntologist posted:

No. In python you should never have to manage your iteration index. Maybe in a while loop or some other place where in some sense you're managing the loop logic yourself. But 99% of the time when I do that I realize there's a better way before I finish coding the loop.

Do you have anything I can read (or would you share) about why this is, particularly in the case of a while loop?

For example I wrote something that clicks a button, takes a screen shot and iterates the file name like: (I dont have the code handy) Is this bad?

code:

i=0

try:
    while True:
	i=i+1
        x = ____
	y = ____
        pyautogui.click(x, y)
	img = screenshot.grab(args)
	img.save('filename' + i + '.png')
except KeyboardInterrupt:
    print('\n')

# ¿ Jun 10, 2018 17:12

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

baka kaba posted:

I think CarForumPoster's example is a whole other thing, that's just an infinite loop that happens to contain an incrementing counter. It's not actually iterating over anything like a collection or a fixed range of values, and I think that's the simplest way to achieve it?

But it's probably better to have a range instead of an infinite loop yeah (you'll run into problems eventually)

Correct. I have to watch a zillion training videos for a new job and I am too lazy to click next and they had tests, but I appreciate the other perspectives and solutions its definitely a problem I encounter a lot where I need to do it over a limited number of items in a list or something.

# ¿ Jun 10, 2018 17:45

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

This was really helpful

# ¿ Jun 11, 2018 03:12

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I didnt know that about PyCharm, thanks thread!

# ¿ Jun 19, 2018 02:59

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

If anyone ever wants a really simple GUI, particularly to make a form for taking arguments and running a script, big shout out to Gooey. Tutorial here:

http://pbpython.com/pandas-gui.html

I used this and a python to exe maker and made some little tools that helped my coworkers who would be scared of the command line.

# ¿ Jul 10, 2018 00:52

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I would like to make a pandas dataframe out of a JSON file containing scraped court data. I'd like all the data to appear in columns but there's lots of nesting and I can't figure out how to get it in a nice big table like I'd like.

I thought the way to go would be something like:

code:

import pandas as pd
import json
from pandas.io.json import json_normalize

with open(r'data.json') as json_data:
    d = json.load(json_data)

data = json_normalize(d)
df = pd.DataFrame.from_dict(data, orient='columns')
df.head()

cases =  pd.read_json( (df['caselink']).to_json(), orient='index')
cases.head()

But that just gives me a 1 x 4000 df and doesnt unnest the data.

csvkit can give me a csv output which is easier but it requires a ton of reoganizing and clean up. I figured a custom solution might be in order.

Data sample: (Very long so I pastebinned)
https://pastebin.com/PhDkrCME

# ¿ Jul 14, 2018 16:15

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

cinci zoo sniper posted:

You won't be able to trivially map a decently hierarchical JSON document to a table or set of tables. I had to solve a similar problem at work and my final (naturally a temporary quick and dirt one to be rewritten better at the time) solution was to write a MongoDB script that does some server-side filtering and removes unnecessary fields and everything, a long-ish Python script with functions to read what i need and write it into pandas. On Python side I did iterate over individual JSON documents. The process was good enough to comb through a couple gigabytes of JSON poo poo reasonably quickly.

Edit: index orient is probably not what you may want in any case.

I dont mind some time investment in solving this issue. I will need to do it weekly or daily and the data will be formatted like this always. Whats the best way to solve it and got a link to learn more about it? I have no experience with MongoDB and dont control the server sending the data but am willing to learn.

CarForumPoster fucked around with this message at 16:58 on Jul 14, 2018

# ¿ Jul 14, 2018 16:55

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

cinci zoo sniper posted:

What's your desired output, to roughly sketch on that sample you linked? I've not seen that many good links on this to be honest, just random blog posts ways back. It's a technologically simple problem that can be labour-"intensive" if your source data and your desired output don't cooperate (like it often is for me since if there's one thing that people's credit histories are not then it's concise). I'll take a look at your file a bit later, on the go right now.

My goal is to use the data in a variety of machine learning, NLP and statistics tasks. CSVkit (the 70% option) may be a better option than I once conceived.

The goal would be to map the fields to a set of names that I can play with.

E.g. right now I take judge_name, one hot encode them, and look for correlations to outcomes.

code:

case_number	charge_type	date_filed	court_name	case_number_long	judge_name	case_status	citation_number	compliance_date	def_name	pltf_name	def_atty	atty_phone	dob	offense_date	charge&state	charge_name_1
abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123	abc123
abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124	abc124

EDIT: It is scraped it from the interwebs.

CarForumPoster fucked around with this message at 17:20 on Jul 14, 2018

# ¿ Jul 14, 2018 17:06

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I love you thread and helpful people within.

# ¿ Jul 18, 2018 01:12

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

edmund745 posted:

Also while searching Google for "python try except", I got the Google code challenge! ,,,,,,,,,Or does everybody get these now?
I declined to take it.
They'd probably just hire me as a contractor, and gently caress that.
I aint takin no lousy 100K job where I gotta live in a 300K neighborhood, or face a 4-hour commute.

I have googled exactly that previously and didn't get it.

# ¿ Aug 27, 2018 21:37

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

edmund745 posted:

Maybe it's a lifetime average sort of thing? Most of what I do search for is digital electronics (part datasheets) and programming stuff.

I can see the interview now...
Google person: "Hello! We're so glad you could meet us today! How is your day going?"
me: blah-blah-blah-blah-blah-blah-blah-blah-blah-
Google person: "Well! You are certainly talkative for someone who is, uhhh..." (shuffles a three-inch-tall stack of papers) "...you're autistic, aren't you?"
me: "oh! , , , , , , , , -um. . . yea! . . ."

I've known multiple google engineers and I doubt this would exclude you.

# ¿ Aug 28, 2018 01:28

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

edmund745 posted:

Yes but I'm kinda white-trashy so I figured I'd be in the slummy part of Palo Alto. So maybe that's a 1-mil neighborhood?

You joke but there is a slummy part of Palo Alto called East Palo Alto. Its separated by 101 and its unmetaphorically the other side of the ~~tracks~~ highway. Probably gentrified by now though.

# ¿ Aug 28, 2018 21:53

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Thermopyle posted:

VS Code is a great editor, but it's not even the same league as PyCharm or IntelliJ as an IDE.

My company makes us use VS Express Desktop 2015 and it is completely elbows and dildoes compared to the free PyCharm I use on personal projects.

# ¿ Sep 6, 2018 22:08

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

CarForumPoster posted:

completely elbows and dildoes

PBS posted:

I have to know what this means.

Uncomfortably bad.

# ¿ Sep 9, 2018 01:04

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

cinci zoo sniper posted:

https://github.com/mahmoud/boltons Some cool stuff.

This, particularly remap, looks really useful.

# ¿ Oct 17, 2018 01:41

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Jose Cuervo posted:

Not sure if this is the right place to ask this, but I want to generate a time series which simulates consumption of a product. One way of doing this is to assume that the consumption has a particular distribution, say Triangular(lower=2, mode=4, upper=5), and then at each time step draw a random variate from that distribution to simulate the amount consumed during that time step. However, generating the time series in this way does not produce any correlation in consumption between successive time steps. That is, if the consumption of the product was on the higher end of the distribution at time t, then the consumption of the product at time t+1 should likely be on the higher end of the distribution as well, and vice versa.

How would I go about adding the correlation aspect to the simulated time series?

If you�re generating based on random variables (0,1) that follow a triangular CDF, you could make an if/then that looks at the random variable from t-1 and keeps generating numbers until the random number is within x distance/percentage/etc of the previous

# ¿ Oct 27, 2018 04:14

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I have a dumb question, at some point in developing stuff does it become more natural to read and work with json? Like are there some of you who can deal with json as intuitively as you would data in a table/dataframe?

# ¿ Nov 22, 2018 09:57

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I have an easy one thats frustrating me. I want to replace a bunch of strings inside df['col1'][i] with values from df['col2'][i]

I have a column in a dataframe df_maps called name with strings in it:
Bob
Dan
Sean

Another called param1
A
B
C

I have a column called 'html' with some html in it and stuff to replace ($name and $param1)
<h2>$name</h2><span style="color: #008000;"><strong>$param1</strong>

This executes:

code:

for i, row in df_maps.iterrows():
    df_maps['combined'] = df_maps['html']
    df_maps['combined'] = df_maps['combined'].str.replace('$name', df_maps['name'][i])
    print(df_maps['name'][i])

but the output of df_maps.combined[1] in jupyter is:
<h2>$name</h2><span style="color: #008000;"><strong>$param1</strong>
instead of
<h2>Dan</h2><span style="color: #008000;"><strong>$param1</strong>

# ¿ Nov 27, 2018 21:02

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

SurgicalOntologist posted:

You are re-creating the column 'combined' at every iteration of the loop, essentially resetting it. Try this:
Python code:
df_maps['combined'] = ''
for i, row in df_maps.iterrows():
    df_maps.at[i, 'combined'] = row.html.replace('$name', row.name)
or
Python code:
def insert_name(row):
    return row.html.replace('$name', row.name)


df_maps['combined'] = df_maps.apply(insert_name)
(you may need to play around with the kwargs to apply to get that to work)

I feel like an idiot and that worked.

# ¿ Nov 27, 2018 21:21

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

What I want to do: Split a dataframe column of name strings into first and last names using nameparser. Have the first and last names in their own columns.

Problem: I get an object with the original name which contains all of the name parts in a "name" object rather than a series that is appended as new columns.

Any suggestions on how to properly solve this problem?

code:

from nameparser import HumanName

def split_names(namestring):
    name = HumanName(namestring)
    firstname = name.first
    lastname = name.last
    return pd.Series([firstname,lastname])

code:


[code]
In:
names["firstname"] = testdf["Defendant"].apply(split_names)
names["firstname"][1]

Out:
<HumanName : [
	title: '' 
	first: 'LINDA' 
	middle: '' 
	last: 'HARRIS' 
	suffix: ''
	nickname: ''
]>

I thought this might be what unstack does but I get an error when trying to use it.

EDIT: This worked.

code:

components = ('title', 'first', 'middle', 'last', 'suffix', 'nickname')

def name_decomp(n):
    h_n = HumanName(n)
    return (getattr(h_n, comp) for comp in components)

rslts = list(zip(*testdf.Defendant.map(name_decomp)))

for i, comp in enumerate(components):
    testdf[comp] = rslts[i]

CarForumPoster fucked around with this message at 13:24 on Dec 4, 2018

# ¿ Dec 4, 2018 12:49

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

I got another easy/dumb one.

Have df["Time"] with

code:

2 days
NULL
NULL
3 yrs
NULL
2 wks
1 yrs1 day

I want to convert these values to days where NULL = 0. I do this once a week on about 2k rows.

code:

temp = df["Time"].str.split()
table = []

for i in range(len(temp)):
    if temp[i][1] == "days":
        temp[i][0] = int(temp[i][0]) 
    elif pd.isnull(temp[i]):
        temp[i] = [0,0]
    elif temp[i][1] == "yrs":
        temp[i][0] = int(temp[i][0])*365
    elif temp[i][1] == "mths":
        temp[i][0] = int(telp[i][0])*30
    elif temp[i][1] == "wks":
        temp[i][0] = int(temp[i][0])*7
    elif temp[i][1] == "yrs1" and temp[i][2] == "days":
        temp[i][0] = int(temp[i][0])*365 + 1
    else:
        temp[i][0] = "some other regex"
    table.append([temp[i][0]])
    print([temp[i]])

df["TimeDays"] = pd.Series(table)

This throws the following when it gets to NULL:
4 for i in range(len(temp)):
----> 5 if temp[i][1] == "days":
6 temp[i][0] = int(temp[i][0])
7 elif pd.isnull(temp[i]):

TypeError: 'float' object is not subscriptable

But I think the real issue is the "isnull" part.

I think its telling me hey there is no "ith" element to look up...but there is right?

# ¿ Dec 7, 2018 16:26

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

cinci zoo sniper posted:

It tries to get 2nd element inside NULL, that temp[i] corresponds to. Also I would suggest to use dateparser for this.

Good call. This works and should be more robust than my previous one:

code:

import dateparser

temp = df["Time"]
temp = temp.fillna("0 days")
temp = temp.str.replace("yrs","years ")
temp = temp.str.replace("mths","months ")
temp = temp.str.replace("wks","weeks ")
temp = temp.str.strip()
temp2 = temp.apply(lambda x: (datetime.today()-dateparser.parse(x)).days)

# ¿ Dec 7, 2018 18:08

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

TLDR: I get ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0') but I dont think I have an NA values.

I have code that wants to look up a string from df["StatCat"] in ArrestTable and return whether its a misdemenor, felony, other in the column "Lev". It will do eventually this for several other columns as well. Because the records arent always exacting in writing the statute number thats in StatCat, it needs to be a fuzzy logic match.

code:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def find_law(row):
    abbrev = process.extractOne(df_raw["StatCat"][row],choices=ArrestTable["StatuteCat"],score_cutoff=80)
    if abbrev:
        #print(df_raw["StatCat"][row])
        #print(abbrev[0])
        return ArrestTable[ArrestTable['StatuteCat'] == abbrev[0]]["Lev"].item()
    return np.nan

I do some regex/string manipulation to get them in a common format to lookup.

code:

df_raw[['StatuteNum','StatSub']] = df_raw["Statute"].str.split('\(|\)', expand=True, n=1).iloc[:,[0,1]]
df_raw["StatSub"] = df_raw["StatSub"].str.replace("(","")
df_raw["StatSub"] = df_raw["StatSub"].str.replace(")","")
df_raw["StatSub"] = df_raw["StatSub"].str.lower()
df_raw["StatCat"] = df_raw["StatuteNum"].map(str) + "&" + df_raw["StatSub"]

I tried to fillna so that the wouldnt be any na values...

code:

df_raw["StatSub"] = df_raw["StatSub"].fillna("")
df_raw["StatuteNum"] = df_raw["StatuteNum"].fillna("")
df_raw["StatCat"] = df_raw["StatCat"].fillna("")

#df_raw["StatLev"] = df_raw["StatCat"].map(ArrestTable.set_index('StatuteCat')["Lev"]) #This doesnt work because due to minor text differences in the staute number, it returns NA ~ 25% of the time.
df_raw["StatLev"] = df_raw.apply(find_law, axis=1)

df_raw[180:190].T

I get a ValueError:

code:


ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0')

---> 12 df_raw["StatLev"] = df_raw.apply(find_law, axis=1)
[...]
      4 def find_law(row):
----> 5     abbrev = process.extractOne(df_raw["StatCat"][row],choices=ArrestTable["StatuteCat"],score_cutoff=80)

EDIT: This works fine, though is slow:

code:

for i in range(20):
    tempvar = find_law(i)
    print(tempvar)

CarForumPoster fucked around with this message at 15:34 on Dec 12, 2018

# ¿ Dec 12, 2018 14:42

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

SurgicalOntologist posted:

Instead of df_raw["StatCat"][row], try row["StatCat"] inside the function you're applying. The point of apply is that the looping is done internally so you shouldn't have to index back into the dataframe. The row that is passed to apply is not the row index, it's the actual row data as a series.

cinci zoo sniper posted:

Yeah I was wondering if there is some indexing magic going on that can actually work like in that example or not.

This was the issue. Thank you both.

The new issue is that the fuzzy logic function is hilariously slow. I need to run it 1M rows. It takes 2m 30s to run on 1000 rows.

I'm going to try to figure out something with .map and str.contains or some other solution that ends up being "good enough"

# ¿ Dec 13, 2018 22:16

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

SurgicalOntologist posted:

Do you have python-levenshtein installed? IIRC it's not a requirement of fuzzywuzzy but if it's available it will run faster. Other than that I would just bite the bullet and run it over the weekend, assuming you only need to run it once.

Yes, still slow as balls and I'm doing EDA with a large, new data set so I tend to run it frequently.

I made some improvements to the lookup table and now am up to 20% NaNs instead of 40% NaNs and it runs lightning fast, but still misses anything that's not an exact match.

code:

df_raw["StatLev"] = df_raw["StatCat"].map(ArrestTable.set_index('StatuteCat')["Lev"])

e.g. the string I have is 456.023&1
the lookup table has
456.023&1a
456.023&1b

But I am fine with returning the first one that it matches with, as its accurate enough, so fuzzywuzzy was returning the result after matching on 456.023&1a

CarForumPoster fucked around with this message at 22:54 on Dec 13, 2018

# ¿ Dec 13, 2018 22:49

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

How do I do something to a range of columns in a df?

I have 10 columns named "DispoClass_#" with the # being 1-10. I want to set to ordinality of the categorical values they contain with .cat.set_categories

How do I select all 10? I need to do this with other things structured as "Name_#" so just writing them out isn't that deirable.

Something like this, but, ya know, works...

code:

df_raw[["DispoClass_1":"DispoClass_10"]].cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

# ¿ Dec 17, 2018 16:16

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

cinci zoo sniper posted:

You should use df.filter() for that.

Python code:

import pandas as pd

test = pd.DataFrame()
test["hello"] = [1, 2, 3]
test["a_01"] = ["foo", "bar", "baz"]
test["a_02"] = ["foo", "bar", "baz"]
test["a_03"] = ["foo", "bar", "baz"]

print(test)

target = test.filter(like='a_', axis=1)
test[target.columns] = target.apply(lambda x: x.str.capitalize())

print(test)

Much appreciated. Also appreciate you helping me last week.

Also I just now found out about : https://regex101.com/

Holy crap is that helpful! I have the worst time with regexs and I basically end up finding someone on stack overflow who wasnt the same thing and cpying the answer.

# ¿ Dec 17, 2018 19:25

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 20:29

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

priznat posted:

Anyone know of a good project that can webscrape for pricing information? I was thinking beautifulsoup mostly. Just have a few pages from vendors I setup watches on (like camelcamelcamel but cross vendor)

I�m sure this has probably been done already a million times so if I could just sponge off one and check out the code that�d be super.

ParseHub is a non programming (its firefoxes element picker for scraping best I can tell) solution that I found pretty easy to use after a few minutes of their tutorial videos. Has some limitations but if you want very quick and dirty, I like it and its free for up to 200 pages at once. You can load like search terms or whatever straight from a csv or json and loop over them which is nice.

# ¿ Jan 18, 2019 14:24

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›9 »