Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ben Nerevarine
Apr 14, 2006
I have no idea where to ask this, so apologies in advance if this is completely not the place.

I've built a tool in Python to scrape a certain wiki for a certain game. The goal here is to bring a bunch of pages down locally and transform the HTML to Markdown files to be read in with Obsidian. Obsidian has this incredibly cool graph view that shows pages and their relationships visually, which is really the whole point of this project, to see sort of a bird's eye view of everything and maybe make mental connections between things that aren't immediately obvious.

I'm scraping about 3000 pages, bringing down one smallish image per page, totaling about 130MB. I'm caching images so I'm not re-downloading them every time I scrape a page for text updates.

I'd like to make this tool publicly available, to give content creators (or anybody really) a starting point for their own note-taking and theorycrafting for said game. Starting with an Obsidian Vault containing all enemies/locations/NPCs, etc., already populated and linked would be helpful to people I think (it has been for me, anyway).

But I'm concerned about the legality of, and my own liability in, releasing such a tool. This wiki is owned and operated by a Chinese company that runs a whole host of game-specific wikis. Their Terms of Service prohibit scraping without express written consent. In my own research it looks like webscraping is not illegal, but I could be liable for breach of contract of their terms of service.

It's one thing for me to be scraping their website a couple times a day while I've been developing this thing, but it's another to release this tool into the wild and have more people hitting their servers for 130MB's worth of bandwidth a pop. I'm not making any assumptions about how (un)popular this tool would be, it's pretty niche, but I'm concerned all the same.

One idea I had was to omit any explicit reference to the wiki, and have the first line of a config file be WIKI_URL= with the value blank, and give a little wink wink nudge nudge in the readme for the user to fill in the blank with their favorite wiki for this game. Still, seems dicey.

Am I opening up a can of worms here that should be left unopened?

Adbot
ADBOT LOVES YOU

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
There might be a few things you can do, and I guess it depends on how it’s licensed too.

Making a HEAD request and checking the Last-Modified header will let you know of yours is stale, if the header is accurate.

There might also be a page like https://en.m.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 where you can collect only the pages changed since the last run.

They might have the semantic media wiki extension https://en.m.wikipedia.org/wiki/Semantic_MediaWiki which is used by liquipedia, but I never figured out how to use it.

I believe if they use the creative commons license, you can rehost your compiled copy of it without anyone else needing to run it.

e: also I’m pretty sure the worst they can do is ban you, but you probably want to avoid that still. If you space it out enough there’s no way they’ll care about it, it’s likely background noise even if others start running it. It’s still a good idea to make it as efficient as you can though, because that’s good for them, other users, and also you.

dougdrums fucked around with this message at 04:10 on Nov 30, 2022

QuarkJets
Sep 8, 2008

I don't think that there's any issues with writing the code or making it publicly available under an open source license. Make the url an input argument, use a config file to store the last-used input as the default for the next invocation.

I am not a lawyer, this is my opinion and not legal advice

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Hey our attorneys have handled this before back in 2019. I am not a lawyer and this is not legal advice. Generally scraping websites as though you’re a human is permissible. Whether there’s civil liability depends on a lot of things potentially including whether your actions cause financial damage to the company. Whether the ToS is binding is an “it depends” as well.

Forming an LLC can be under $200 and may indemnify you from the worries. If they sue they’re suing an LLC with no money.

CarForumPoster fucked around with this message at 05:27 on Nov 30, 2022

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
Fwiw, I always put an email address in the user agent, and I’ve only been asked to stop once out of all the times I made things like this. I wrote a client for predictit’s “undocumented” api some years ago. I never released it or used it to make more calls than the site itself, so I take it they just didn’t want people using automated tools.

duck monster
Dec 15, 2004

Is it just me or is Pipenv fragile as gently caress?

Although in my curmugeonly way I had to be dragged into it kicking and screaming, once I stopped and learned it, I love the whole idea. Its a pretty good way of specifying a repeatable python environ for dev and deploy.

But good loving grief is it temperamental. I'll have something that works great on dev, and then when deployed it just explodes in a ball of python errors.

qsvui
Aug 23, 2003
some crazy thing
I thought everyone moved onto Poetry now? Not that I know if it's any better...

Ben Nerevarine
Apr 14, 2006
Thanks for the responses above!


Some very good ideas here on saving bandwidth I hadn't considered.

CarForumPoster posted:

Forming an LLC can be under $200 and may indemnify you from the worries. If they sue they’re suing an LLC with no money.

I'm thinking this might not be a bad thing to do regardless, and something I've been meaning to do at some point anyway if I maybe want to consider making some money off of these dumb little projects of mine.

Precambrian Video Games
Aug 19, 2002



Am I wrong for wanting to implement immutable objects as frozen dataclasses with cached properties? For example:

Python code:
from dataclasses import dataclass
from functools import cached_property

@dataclass(frozen=True)
class Product:
	x: int = 1
	y: int = 2

	@cached_property
	def xy(self):
		return self.x * self.y
I suppose the most meaningful difference between cached_property and computing a bunch of derived quantities in __init__ (or __post_init__ for a frozen dataclass) is that evaluation is deferred to whenever Product.xy is first called, which may not be desirable, but the bonus is that properties that aren't read are never computed in the first place. Either way, this looks way better to me than either flavour of namedtuple.

(and yes, I know nothing is truly immutable as you can call object.__setattr__ if you really want to, I'm just trying to avoid having single-underscore attributes or methods)

QuarkJets
Sep 8, 2008

I think you're correct.

Zugzwang
Jan 2, 2005

You have a kind of sick desperation in your laugh.


Ramrod XTreme
I did not know about cached_property. That looks awesome and I can’t wait to simplify some of my code with it.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Zugzwang posted:

I did not know about cached_property. That looks awesome and I can’t wait to simplify some of my code with it.

Oh yeah that is rad.

mila kunis
Jun 10, 2011
So since the OP is pretty old, what's considered the best tutorial/course for learning python for someone who's never programmed before these days?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

mila kunis posted:

So since the OP is pretty old, what's considered the best tutorial/course for learning python for someone who's never programmed before these days?

I personally recommend https://automatetheboringstuff.com/ - it's available free online but he also has a course on Udemy and a book. There's no magic bullet though, so it might not work for you, but for me it was the thing that finally made software dev click into place.

Zugzwang
Jan 2, 2005

You have a kind of sick desperation in your laugh.


Ramrod XTreme
+1 for ATBS. It’s not just a good intro to the language. It immediately shows you how to do truly useful things with it. (It’s also free!)

Al’s follow up book to that, Beyond the Basic Stuff With Python, is also great. It’s about how to put together nontrivial, maintainable programs once you know the ins and outs of the language.

Sad Panda
Sep 22, 2004

I'm a Sad Panda.
Replit.com has a 100 days of Python course which talks you through many key concepts and applies then in scenarios. It's free and has videos explaining it.

duck monster
Dec 15, 2004

qsvui posted:

I thought everyone moved onto Poetry now? Not that I know if it's any better...

Oh god. Why are people creating even more package managers. Are people REALLY pining for the toolchain hell that is javascript THAT much?

PIP works!

(Oh and I suppose theres Conda. The maths people seem to like that one for some reason)

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
I think I have like 4 different python environment managers on my computer.

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)
One day there’s gonna be a docker image for every possible python package and I’ll have 4 different tools on my computer to manage that.

Zugzwang
Jan 2, 2005

You have a kind of sick desperation in your laugh.


Ramrod XTreme
I just install everything into base and use only the packages that win the ensuing melee.

Foxfire_
Nov 8, 2010

Pip doesn't actually work very well since it's not good at distributing binaries that go with python bindings

QuarkJets
Sep 8, 2008

duck monster posted:

Oh god. Why are people creating even more package managers. Are people REALLY pining for the toolchain hell that is javascript THAT much?

PIP works!

(Oh and I suppose theres Conda. The maths people seem to like that one for some reason)

Pip works well for pure python packages, conda works well if you also have non-python dependencies. The Intel MKL is a faster binding for numpy, sometimes much faster, and while there are instructions for compiling numpy from source with the mkl (which you of course installed separately) gently caress actually doing that, and quadruple-gently caress doing that in Windows lmao

duck monster
Dec 15, 2004

QuarkJets posted:

Pip works well for pure python packages, conda works well if you also have non-python dependencies. The Intel MKL is a faster binding for numpy, sometimes much faster, and while there are instructions for compiling numpy from source with the mkl (which you of course installed separately) gently caress actually doing that, and quadruple-gently caress doing that in Windows lmao

Yeah I get that with Conda. All those hosed up nvidia and intel binary blobs that need to be rustled. Or f*#?king GDAL lol.

I just wish Pipenv was fit for purpose. I like the idea. Its just.... kind of broken

CarForumPoster
Jun 26, 2013

⚡POWER⚡

duck monster posted:

Yeah I get that with Conda. All those hosed up nvidia and intel binary blobs that need to be rustled. Or f*#?king GDAL lol.

I just wish Pipenv was fit for purpose. I like the idea. Its just.... kind of broken

Yea sometimes there’s a conda install x that works much better than a similar pip install x thing when you need binaries. When you end up deploying a docker image anyway, it can be a big help.

IIRC at various times this has been the only way to successfully install one at least one OS…

Selenium
FastAI
Scikit Learn
PyTorch

Ihmemies
Oct 6, 2012

I guess numpy is just drat slow. In Advent of Code 2022/08 I saved data to a numpy 2d array. When I iterated through the array, two python for loops were like twice as fast than using numpy.ndarray() iterator. I began to think if more numpy functions are so slow too.

I timed the parts of my program:
time to do slices: 4.49991ms
time for silver: 63.13467ms
time for gold: 51.50747ms
time total: 119.14206ms

For silver I was doing this around 40000 times:
code:
max_trees_around.append(numpy.max( lf_slice )) 
So I switched to
code:
max_trees_around.append(max(lf_slice.flatten().tolist())) 
time to do slices: 5.00917ms
time for silver: 39.77084ms
time for gold: 40.23886ms
time total: 85.01887ms
...

I even tried running the whole solution 100 times with timeit.
With python's max(): 9040.51970ms
With numpy.max(): 12312.64960ms

Edit: I got the same with numpy.amax(lf_slice), numpy.nanmax() was quite a bit slower.
Edit2: I guess numpy needs a lot more data than a 10k cell 2d array. Welp. At least it's so easy to access slices compared to python's 2d array with lists.

Ihmemies fucked around with this message at 18:25 on Dec 8, 2022

Seventh Arrow
Jan 26, 2005

Apologies for the bunches of text incoming. This is my first foray into pandas' "to_sql" function.

I have a dataframe full of metadata pulled from salesforce. I'm trying to push this dataframe to mysql (my manager wants to use mysql because it's free, of course).

The dataframe looks ok when I inspect it. The to_sql connection string works ok. But I get the following error:

quote:

ProgrammingError: (pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 0, 0, 0, NULL, (), NULL, NULL, 0, 0, 0, 0, 'tns:ID', 1, 'id', 0, 0, 0),(1, 0,' at line 1")

Of course the full error is big ol' text dump if you're morbidly curious, or think it will help: https://www.klgrth.io/paste/8795j

The python code is below. The salesforce credentials are from a randomly generated instance populated with fake data, so you're free to try the below code or pulling it into a dataframe or whatever:

code:
import pandas as pd
from simple_salesforce import Salesforce, SalesforceLogin, SFType
import sqlalchemy
import pymysql

engine = sqlalchemy.create_engine("mysql+pymysql://<username>:<password>@localhost/python_data_loader")


class SalesforceMetadata:
    def __init__(self, username, password, security_token=None, domain=None):
        self.username = username
        self.password = password
        self.security_token = security_token
        self.domain = domain 
        self.salesforce = None

    def connect_to_salesforce(self):
        try:
            self.session_id, self.instance = SalesforceLogin(username=self.username, password=self.password, security_token=self.security_token, domain=self.domain)
            self.salesforce = Salesforce(session_id=self.session_id, instance=self.instance)
            print(f'Connected to "{self.instance}"')
        except Exception as e:
            print(e)

    def salesforce_org_metadata_list(self):
        """
        return Salesforce org objects metafsys
        """
        if self.salesforce is None:
            print('Please connect to Salesforce first')
            return
        df = pd.DataFrame(self.salesforce.describe())
        df = df['sobjects'].apply(pd.Series)
        return df

    def object_metadata(self, object_api_name):
        if self.salesforce is None:
            print('Please connect to Salesforce first')
            return
        try:
            obj = SFType(object_api_name, session_id=self.session_id, sf_instance=self.instance)
            df = pd.DataFrame(obj.describe()['fields'])
            return df
        except Exception as e:
            print(e)
                
    def extract_picklist_values(self, object_metadata_df, field_api_name):
        picklist_values = []
        picklist_list_column = (object_metadata_df[object_metadata_df['name']==field_api_name]['picklistValues'].apply(pd.Series).T)
        if picklist_list_column.empty:
            print('Picklist is empty.')
            return
        for item in (object_metadata_df[object_metadata_df['name']==field_api_name]['picklistValues'].apply(pd.Series).T).iterrows():
            picklist_values.append(item[1].values[-1]['value'])
        return picklist_values

    
username = 'test-psh23q4xjkbt@example.com'
password = 'bfkuG5wf]snic'
domain = 'test'
security_token = None


# construct sf_metadata object
sf_metadata = SalesforceMetadata(username, password, security_token, domain)


# connect to Salesforce
sf_metadata.connect_to_salesforce()


# return organization metadata (this will return overall detail of every single Saleforce object in an org)
df_org_metadata = sf_metadata.salesforce_org_metadata_list()
df_org_metadata.to_csv('org metadata sample.csv', index=False)
# print(df_org_metadata)


# to return just an object's metadata as a dataframe
df_account = sf_metadata.object_metadata('account')
df_opportunity = sf_metadata.object_metadata('opportunity')


# push data to MySQL db
object_list = ['Account']
for object_api_name in object_list:
    df = sf_metadata.object_metadata(object_api_name)
    if df is None:
        print(f'Object {object_api_name} is not found.')
    else:
        df.to_sql(con=engine,name=object_api_name,if_exists='replace', index=True)
Thanks in advance for any help. I kinda-sorta suspect that somehow a funky character or spacing or something is being passed to mysql but I'm not exactly sure.

Also I tried other mysql modules (like mysqldb) to no avail.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Seventh Arrow posted:

Apologies for the bunches of text incoming. This is my first foray into pandas' "to_sql" function.

I have a dataframe full of metadata pulled from salesforce. I'm trying to push this dataframe to mysql (my manager wants to use mysql because it's free, of course).

The dataframe looks ok when I inspect it. The to_sql connection string works ok. But I get the following error:

Of course the full error is big ol' text dump if you're morbidly curious, or think it will help: https://www.klgrth.io/paste/8795j

The python code is below. The salesforce credentials are from a randomly generated instance populated with fake data, so you're free to try the below code or pulling it into a dataframe or whatever:

code:
import pandas as pd
from simple_salesforce import Salesforce, SalesforceLogin, SFType
import sqlalchemy
import pymysql

engine = sqlalchemy.create_engine("mysql+pymysql://<username>:<password>@localhost/python_data_loader")


class SalesforceMetadata:
    def __init__(self, username, password, security_token=None, domain=None):
        self.username = username
        self.password = password
        self.security_token = security_token
        self.domain = domain 
        self.salesforce = None

    def connect_to_salesforce(self):
        try:
            self.session_id, self.instance = SalesforceLogin(username=self.username, password=self.password, security_token=self.security_token, domain=self.domain)
            self.salesforce = Salesforce(session_id=self.session_id, instance=self.instance)
            print(f'Connected to "{self.instance}"')
        except Exception as e:
            print(e)

    def salesforce_org_metadata_list(self):
        """
        return Salesforce org objects metafsys
        """
        if self.salesforce is None:
            print('Please connect to Salesforce first')
            return
        df = pd.DataFrame(self.salesforce.describe())
        df = df['sobjects'].apply(pd.Series)
        return df

    def object_metadata(self, object_api_name):
        if self.salesforce is None:
            print('Please connect to Salesforce first')
            return
        try:
            obj = SFType(object_api_name, session_id=self.session_id, sf_instance=self.instance)
            df = pd.DataFrame(obj.describe()['fields'])
            return df
        except Exception as e:
            print(e)
                
    def extract_picklist_values(self, object_metadata_df, field_api_name):
        picklist_values = []
        picklist_list_column = (object_metadata_df[object_metadata_df['name']==field_api_name]['picklistValues'].apply(pd.Series).T)
        if picklist_list_column.empty:
            print('Picklist is empty.')
            return
        for item in (object_metadata_df[object_metadata_df['name']==field_api_name]['picklistValues'].apply(pd.Series).T).iterrows():
            picklist_values.append(item[1].values[-1]['value'])
        return picklist_values

    
username = 'test-psh23q4xjkbt@example.com'
password = 'bfkuG5wf]snic'
domain = 'test'
security_token = None


# construct sf_metadata object
sf_metadata = SalesforceMetadata(username, password, security_token, domain)


# connect to Salesforce
sf_metadata.connect_to_salesforce()


# return organization metadata (this will return overall detail of every single Saleforce object in an org)
df_org_metadata = sf_metadata.salesforce_org_metadata_list()
df_org_metadata.to_csv('org metadata sample.csv', index=False)
# print(df_org_metadata)


# to return just an object's metadata as a dataframe
df_account = sf_metadata.object_metadata('account')
df_opportunity = sf_metadata.object_metadata('opportunity')


# push data to MySQL db
object_list = ['Account']
for object_api_name in object_list:
    df = sf_metadata.object_metadata(object_api_name)
    if df is None:
        print(f'Object {object_api_name} is not found.')
    else:
        df.to_sql(con=engine,name=object_api_name,if_exists='replace', index=True)
Thanks in advance for any help. I kinda-sorta suspect that somehow a funky character or spacing or something is being passed to mysql but I'm not exactly sure.

Also I tried other mysql modules (like mysqldb) to no avail.

I didn’t read but iirc % text when you’re trying to do %s screws things up. Iirc a work around is to use a text() function. I’m posting from bed so can’t Google much more than that

QuarkJets
Sep 8, 2008

I don't really know the answer, but since you're just using a canned dataframe method I'd guess that there's some issue in your data. I'm not about to go digging through that huge one-line dump of it so good luck with that

You can help isolate the problem by just inserting one row at a time, and if it breaks on the very first row then try selecting a specific range of columns (say the first 10 of them).

QuarkJets
Sep 8, 2008

CarForumPoster posted:

I didn’t read but iirc % text when you’re trying to do %s screws things up. Iirc a work around is to use a text() function. I’m posting from bed so can’t Google much more than that

FYI OP this is a red herring, it'd only matter if you were doing your own insertions with pymysql statements instead of passing a connector to the dataframe (which is also something you can try)

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Also the ‘connectionless commits’ in sqlAlchemy (your df.to_sql(con=engine,…)) are deprecated, you should do this as

code:
with engine.connect() as conn:
    df.to_sql(con=conn,…)
Edit: from a very brief scan of the error message I’d venture that the issue is going to turn out to be that the source data contains some text in brackets that pandas is either parsing into a tuple or not quoting properly and is loving up the sql insert statements. The way to figure it out is to iteratively narrow down the problem row/column, as QuarkJets suggested above.

DoctorTristan fucked around with this message at 07:07 on Dec 9, 2022

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
numpy sucks, pandas sucks, gotta use it but it'll break so who tf knows

don't use mysql ever, replace with postgres, everything will work fine then

QuarkJets
Sep 8, 2008

oh yeah and point out to your manager that postgres is also free

QuarkJets
Sep 8, 2008

Ihmemies posted:

I guess numpy is just drat slow. In Advent of Code 2022/08 I saved data to a numpy 2d array. When I iterated through the array, two python for loops were like twice as fast than using numpy.ndarray() iterator. I began to think if more numpy functions are so slow too.

I timed the parts of my program:
time to do slices: 4.49991ms
time for silver: 63.13467ms
time for gold: 51.50747ms
time total: 119.14206ms

For silver I was doing this around 40000 times:
code:
max_trees_around.append(numpy.max( lf_slice )) 
So I switched to
code:
max_trees_around.append(max(lf_slice.flatten().tolist())) 
time to do slices: 5.00917ms
time for silver: 39.77084ms
time for gold: 40.23886ms
time total: 85.01887ms
...

I even tried running the whole solution 100 times with timeit.
With python's max(): 9040.51970ms
With numpy.max(): 12312.64960ms

Edit: I got the same with numpy.amax(lf_slice), numpy.nanmax() was quite a bit slower.
Edit2: I guess numpy needs a lot more data than a 10k cell 2d array. Welp. At least it's so easy to access slices compared to python's 2d array with lists.

The biggest problem with numpy is that it is very eager to create temporary arrays. Every time that you think that you're doing some operation in-place numpy is probably creating a temporary array behind the scenes to store intermediate results. So while numpy is great for writing code quickly, it's better to write your own vectorized functions with numba.

But in this case it sounds like you're spending a lot of time in a python for loop, which is really the performance bottleneck here. Definitely push that into a numba kernel.

Seventh Arrow
Jan 26, 2005

QuarkJets posted:

I don't really know the answer, but since you're just using a canned dataframe method I'd guess that there's some issue in your data. I'm not about to go digging through that huge one-line dump of it so good luck with that

You can help isolate the problem by just inserting one row at a time, and if it breaks on the very first row then try selecting a specific range of columns (say the first 10 of them).

This would be time-consuming but it's definitely a good idea. I think there's ways to filter out rows in pandas directly, so I will look into that - thanks!

But I highly suspect that it may have something to do with something said ˅˅˅˅


DoctorTristan posted:

Also the ‘connectionless commits’ in sqlAlchemy (your df.to_sql(con=engine,…)) are deprecated, you should do this as

code:
with engine.connect() as conn:
    df.to_sql(con=conn,…)
Edit: from a very brief scan of the error message I’d venture that the issue is going to turn out to be that the source data contains some text in brackets that pandas is either parsing into a tuple or not quoting properly and is loving up the sql insert statements. The way to figure it out is to iteratively narrow down the problem row/column, as QuarkJets suggested above.

Yes I think you're right - to give a bit of background, there are some columns where the (meta)data is supposed to represent a multi-choice field. These have square and round brackets and are separated by commas and from what little I know about database design, it's a disaster waiting to happen. Maybe the best option is to filter out all of the special characters and have them just separated by spaces. Or maybe push them to a separate table.


DELETE CASCADE posted:

numpy sucks, pandas sucks, gotta use it but it'll break so who tf knows

don't use mysql ever, replace with postgres, everything will work fine then


QuarkJets posted:

oh yeah and point out to your manager that postgres is also free

noted, I will test it out and see.

Thanks everyone!

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Also as a very general rule I’d avoid using pandas for data pipeline work like this because the ‘helpful’ inferring of column data types will frequently gently caress you over in exactly this way. Generally the safer way is to upload the data as text into a staging table and then do type conversions on the database side.

(The best method is to use proper ELT tools, but from the sound of things that isn’t an option)

QuarkJets
Sep 8, 2008

OP I think your observations here are the payoff from the big recommendations a few weeks ago: that you should carefully inspect the data so that you understand what it contains. After a few posters suggested that this is a data problem you were able to point out that there's a multiple-choice field that contains bracketed options, and you were able to deduce that maybe that's the real cause of the issue. I think that's probably right, since the error is complaining about parens in an SQL statement that pandas created from your data.

So what should you do about this, do you really need this column in the database? It sounds like these are options shown to a user in a UI maybe, that doesn't sound useful for a database but I have no idea whether you may actually need it. Drop that column if you can. Ideally you'd only query for the smallest-necessary number of columns, right now you have 57 columns and I'm guessing you don't need to put all of those in a database table but maybe I'm wrong.

DoctorTristan is right to point out that there are pitfalls in using pandas, if there's more info you can extract from the SalesForce AI then you may be able to start catching these issues before they occur - deviations in data types are especially troublesome (e.g. maybe there's a field that is normally an integer but sometimes returns a string or null or some other poo poo).

Seventh Arrow
Jan 26, 2005

DoctorTristan posted:

Also as a very general rule I’d avoid using pandas for data pipeline work like this because the ‘helpful’ inferring of column data types will frequently gently caress you over in exactly this way. Generally the safer way is to upload the data as text into a staging table and then do type conversions on the database side.

(The best method is to use proper ELT tools, but from the sound of things that isn’t an option)

Thanks for the advice - and yes, I think a lot of this boils down to the fact that I'm having to do a lot of stuff as a data engineer that I haven't before. It might be possible to look into different ETL solutions but they're not pushing a lot of data so something like Hadoop seems like overkill. I'll see how things go, however! I'm wondering now if pyspark has the same pitfalls as pandas, then.


QuarkJets posted:

OP I think your observations here are the payoff from the big recommendations a few weeks ago: that you should carefully inspect the data so that you understand what it contains. After a few posters suggested that this is a data problem you were able to point out that there's a multiple-choice field that contains bracketed options, and you were able to deduce that maybe that's the real cause of the issue. I think that's probably right, since the error is complaining about parens in an SQL statement that pandas created from your data.

So what should you do about this, do you really need this column in the database? It sounds like these are options shown to a user in a UI maybe, that doesn't sound useful for a database but I have no idea whether you may actually need it. Drop that column if you can. Ideally you'd only query for the smallest-necessary number of columns, right now you have 57 columns and I'm guessing you don't need to put all of those in a database table but maybe I'm wrong.

DoctorTristan is right to point out that there are pitfalls in using pandas, if there's more info you can extract from the SalesForce AI then you may be able to start catching these issues before they occur - deviations in data types are especially troublesome (e.g. maybe there's a field that is normally an integer but sometimes returns a string or null or some other poo poo).

Thanks also, I'm learning what I can. TBH, even if my manager didn't suggest using python for this pipeline, I would have suggested it anyways because it's familiar to me and it seems like it can "do everything." But now that I'm learning about the weaknesses of pandas, maybe not!

The columns are needed. The idea is that they want an always-up-to-date set of metadata in a database. There are stored procedures (which I will need to write!) that will check the client's data against this metadata (i.e., does it exceed the allowed amount of characters, are the country codes correct, etc.) and checking if it has the right "picklist" values is part of that. I think it will still work if all the values are just separated by a space and with no brackets.

Ultimately it's a welcome challenge though...in my last position I didn't get to use programming at all, I'm glad that I'm at least getting to use this stuff in a practical way.

SporkOfTruth
Sep 1, 2006

this kid walked up to me and was like man schmitty your stache is ghetto and I was like whatever man your 3b look like a dishrag.

he was like damn.
I've got a (very possibly stupid) question on generators prompted by some discussion in the Coding Horrors thread: is there a way to conditionally change the order of yield statements in a loop?

The inspiring post:

QuarkJets posted:

(...)
The loop control variable isn't hidden, it's part of the Sequence that was created when you invoked range(). Generators work this way too:

Python code:
def weird_number_generator():
    yield 0
    yield 10
    yield -40
    yield 50

values = weird_number_generator()
x = next(values)
print(x)  # 0
x += 4
print(x)  # 4
x = next(values)
print(x)  # 10
print(next(values))  # -40

The issue at hand is this: a colleague of mine has been getting familiar with a domain-specific package I work on as part of a bigger team.
This package basically has Agents producing data from a number of Sources per agent at discrete time steps given some Inputs -- something like this:
Python code:
# inputs is a list of Input objects -- a time and an input
# agents is a list of Agent objects that each have a number of Source objects in the property "sources"
# Sources have a method `do_thing` that acts on an Input 

def thing_generator(inputs, agents):
    for time, input in inputs:
        for agent in agents:
            for source in agent.sources:
                outputs = source.do_thing(input)
                yield time, outputs
Obviously, calling `next` on `thing_generator` goes in reverse order of the loops: all of the sources for one agent are exhausted before you move to the next agent, and all of the agents must be completed before we get to the next time step/input pair. The way it's used in an example in the docs is something like:
code:
g = thing_generator(mock_inputs, mock_agents)

# Deep copy only to show the results for plotting
g_copy = deepcopy(g)

for t in mock_inputs.times():
   agent1_source1_outputs.append(next(g_copy)[1])
   agent1_source2_outputs.append(next(g_copy)[1])
  
   agent2_source1_outputs.append(next(g_copy)[1])
   agent2_source2_outputs.append(next(g_copy)[1])

 # Some plotting poo poo is done with the lists made above
My colleague hates this. Deeply. He can't understand why next would ever be used in our context outside of "next timestamp," and asked if there was any way to parameterize a desired order. He also didn't like how next isn't really self-documenting here, which I can sympathize with -- if you're not a bit careful, you can fall into the trap of forgetting the yield order.

Any suggestions?

QuarkJets
Sep 8, 2008

The ordering in this example is somewhat arbitrary, so you could just make the time loop the innermost loop.

If you wanted to give anyone the freedom of either style then you could split up the generator. There's no reason that this generator needs to accept "inputs" as an input argument, it can accept "input" instead. But don't name a variable "input" as that's a built-in

Adbot
ADBOT LOVES YOU

C2C - 2.0
May 14, 2006

Dubs In The Key Of Life


Lipstick Apathy
Anyone else using Pycharm on a Macbook? I'm doing the 100 Days of Python off of Udemy and am working thru the course in Pycharm. I've got a directory set up that has individual folders for each day. Typically each morning I go into Finder & just create a new folder (say, "Day 35"). Then I open Pycharm, select "New Project" and then navigate to that day's folder. Inevitably, when I click "Okay" on the pop-up dialog where you choose the interpretor & a few other things, it returns another pop-up telling me that the folder isn't empty & I just select "Create Project From Source". Then the project loads and the empty Icon file opens in the editor. Is there any way to ignore this file or otherwise keep it from opening immediately upon each new project creation?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply