Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Sad Panda
Sep 22, 2004

I'm a Sad Panda.

samcarsten posted:

Ok, got that working, new error when I try to click the button.

Message: stale element reference: element is not attached to the page document

Tried google, didn't help.

Can be caused by a variety of factors, one being if the page has updated itself.

When I was using Selenium I often did it in Pycharms interactive debug mode to help. It can be very fiddly. Especially deciding on the right selector to get the element you need on poorly designed sites.

Adbot
ADBOT LOVES YOU

samcarsten
Sep 13, 2022

by vyelkin

Sad Panda posted:

Can be caused by a variety of factors, one being if the page has updated itself.

When I was using Selenium I often did it in Pycharms interactive debug mode to help. It can be very fiddly. Especially deciding on the right selector to get the element you need on poorly designed sites.

I tried using the wait command after inputting the text, but that didn't work. any other ideas?

QuarkJets
Sep 8, 2008

12 rats tied together posted:

Ruby guru + woman sounds like Sandi Metz, who I have posted about ITT, but she is pro-OOP in the form of Alan Kay's "OOP is mostly message passing", and against the improper use of inheritance.

The video I always link is from a talk she does called "nothing is something", and it is very good. The talk is basically an example of inheritance gone wrong, how to recognize it, and how to fix it.

https://www.youtube.com/watch?v=OMPfEXIlTVE

Apologies if this is not what you were thinking of. :)

That's the one! Thank you, this is a great talk

QuarkJets
Sep 8, 2008

Josh Lyman posted:

The columns always have the same names because they're converted from a numpy array in a previous statement with mydf_in = pd.DataFrame(myNumpyArray). When I print mydf_in, the columns are just "named" 0 1 2.

Doesn't the randomness of the KeyError preclude a typo?

Why is list_column an argument of the function? According to this explanation, isn't it always "2"?

What specific key is mentioned in the KeyError exception? If it's something you don't expect (it sounds like this should only ever be "2") then there's a typo in your code. If it's something you do expect, then you encountered an input array with an unexpected shape and need to write code to handle that

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
i wonder if someone went thru the history of this thread and counted up the number of python issues that wouldn't have existed in a language with static typing, what % of the posts would that be

QuarkJets
Sep 8, 2008

DELETE CASCADE posted:

i wonder if someone went thru the history of this thread and counted up the number of python issues that wouldn't have existed in a language with static typing, what % of the posts would that be

Index and key errors are the most common I'm pretty sure, static typing wouldn't help with those. For instance if Josh Lyman was using C they'd probably be asking for guidance on how to debug an infrequently occurring segmentation fault. Although there have been some newer programmers itt who have received advice on how to avoid typing issues before they can become a problem (e.g. Try not to change a label's type)

A lot of people just want style tips or are asking for library recommendations

Josh Lyman
May 24, 2009


QuarkJets posted:

Why is list_column an argument of the function? According to this explanation, isn't it always "2"?

What specific key is mentioned in the KeyError exception? If it's something you don't expect (it sounds like this should only ever be "2") then there's a typo in your code. If it's something you do expect, then you encountered an input array with an unexpected shape and need to write code to handle that
It says "KeyError: 2". I'll remove that as an argument and see if that helps. :shrug:

Of the last 20,000 or so mydf_in's, the KeyError occurred about 25 times.

edit: Modified the code to remove the function call and put the statements directly in the script and cleaned it up a bit, should help with debugging if the KeyErrors still happen.

Josh Lyman fucked around with this message at 07:54 on Nov 15, 2022

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Josh Lyman posted:

It says "KeyError: 2". I'll remove that as an argument and see if that helps. :shrug:

Of the last 20,000 or so mydf_in's, the KeyError occurred about 25 times.

edit: Modified the code to remove the function call and put the statements directly in the script and cleaned it up a bit, should help with debugging if the KeyErrors still happen.

Reducing encapsulation is rarely going to help you and I really recommend you don’t do that. I’d say about 40% of the weird python errors I’ve had to help coworkers with were caused by a name collision in a huge monolithic script, where breaking it up into smaller functions would have either prevented it entirely or made the error much more obvious.

It’s a bit inelegant, but have you tried catching the KeyError and setting a breakpoint inside the catch statement? That should let you inspect the variables at the point of the error and get a better idea of what’s going on.

Also, what’s your source for this data? Your comment about how three notebooks simultaneously hit an error kind of makes me suspect that the issue might originate with your data source (eg a database connection dropped, or JWT token expired) but is not getting caught until later.

QuarkJets
Sep 8, 2008

Josh Lyman posted:

It says "KeyError: 2". I'll remove that as an argument and see if that helps. :shrug:

Of the last 20,000 or so mydf_in's, the KeyError occurred about 25 times.

edit: Modified the code to remove the function call and put the statements directly in the script and cleaned it up a bit, should help with debugging if the KeyErrors still happen.

Given that KeyError, we can conclude that the function received a dataframe that didn't have 3 columns of data. I think it's likely that you received an array with an unexpected shape, and that's probably what you'll see if you inspect one of these objects the next time that you encounter the exception

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
If you can, run it in a debugger (I know PyCharm and VSCode both have good debugging tools, but you can also figure out how to use good old fashioned PDB), this gives you a much better ability to see wtf is going on, and is an invaluable skill to build.

The number of people I know who are like 'oh I'll just add a ton of print statements instead of one breakpoint' drives me a little crazy sometimes.

Josh Lyman
May 24, 2009


QuarkJets posted:

Given that KeyError, we can conclude that the function received a dataframe that didn't have 3 columns of data. I think it's likely that you received an array with an unexpected shape, and that's probably what you'll see if you inspect one of these objects the next time that you encounter the exception
I was finally able to catch one of these occurrences as it was happening. I assumed the input dataframe would be good because I had existing code that does error checking for that. However, I completely overlooked that I had placed the new code (that throws the KeyError) above where the existing code has a chance to do error checking, so I've now moved it downscript.

Hopefully that fixes the issue! :cheers:

edit: It fixed the issue, thanks everyone for your help!

Josh Lyman fucked around with this message at 21:45 on Nov 18, 2022

Seventh Arrow
Jan 26, 2005

I've been having a rough time figuring out how to get this to work - unfortunately since this is a Django thing with a bunch of interrelated files, this will have a lot of code snippets, sorry.

Basically I have a function that dumps a bunch of data into a csv. There's a text field where the client types in their name and this function pulls the "name" tag from the html form and makes that part of the filename. So far so good. But I also want to make the type of file part of the file name as well and it's behaving badly.

Here's the html that's being pulled: (full contents at pastebin)

code:
 <div class="p-1 col-12 fw-bold mb-2">
                    <label class="text-r mb-1">Select File Type:</label>
                    <select name="type" class="form-select" aria-label="Default select example">
                        <option value="accounts">Accounts</option>
                        <option value="contacts">Contacts</option>
                        <option value="membership">Membership</option>
                        <option value="cg">Community Group</option>
                        <option value="cgm">Community Group Member</option>
                        <option value="so">Sales Order</option>
                        <option value="item">Item</option>
                        <option value="event">Event</option>
                        <option value="tt">Ticket Type</option>
                        <option value="si">Schedule Item</option>
                        <option value="attendee">Attendee</option>
                    </select>
                </div>
                <div class="p-1 col-12 fw-bold mb-2">
                    <label class="text-r mb-1">Organization Name:</label>
                    <input class="form-control" placeholder="Organization Name" type="text" name="name" required />
                </div>
Here's the function that exports to csv:

code:
class CsvView(View):
    def post(self, request, *args, **kwargs):
        output = io.BytesIO()
        workbook = xlsxwriter.Workbook(output)
        worksheet = workbook.add_worksheet()
        data = request.POST["raees"]
        name = request.POST["name"]
        d_type = request.POST["type"]
        data = list(data.split(","))
        last = data[-1]
        first = data[0]
        data[0] = first.replace("[", "")
        data[-1] = last.replace("]", "")
        row = 0
        col = 0
        for i in data:
            i = i.replace("'", "")
            worksheet.write(row, col, i)
            row = row + 1
        workbook.close()
        output.seek(0)
        filename = f"{name} {d_type} Issue Tracker.xlsx"
        response = HttpResponse(
            output,
            content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        )
        response["Content-Disposition"] = "attachment; filename=%s" % filename
        return response
So for filename = f"{name} {d_type} Issue Tracker.xlsx", the {name} part works ok, but {d_type} does not. There's a previous function in the same script that uses "type", as you can see here - this works ok.

I tried d_type = request.POST.get("type") as well as d_type = request.GET["type"] to no avail. Sometimes I get a "MultiValueDictKeyError when trying to retrieve "type"" error, but usually it just doesn't put the type into the filename. I am admittedly taking a shotgun approach here.

Is there maybe a simple way to test the results of request.GET["type"]? Thanks for any help.

If you want to take a look at the application, it's here: https://csvcheck971.pythonanywhere.com/

Jose Cuervo
Aug 25, 2004
Finally getting a chance to keep working on this database. This is the code I have to set up and populate the database:
Python code:
import sqlite3
import pandas as pd

conn = sqlite3.connect('study_data.db')
c = conn.cursor()
c.execute("""CREATE TABLE blood_glucose (
            study_name text NOT NULL,
            SID text NOT NULL,
            date_time text NOT NULL,
            bg_mg_per_dL integer NOT NULL
            )""")

def insert_dataframe_data(study_name, df, c):
    for _, row in df.iterrows():
        c.execute(
            "INSERT INTO blood_glucose VALUES (:study_name, :SID, :date_time, :bg_mg_per_dL)", 
            {
                'study_name': study_name,
                'SID': row['SID'], 
                'date_time': row['Date_Time'], 
                'bg_mg_per_dL': row['Value']
            }
        )

for study_name in ['dclp3', 'wisdm', 'dclp5', 'sence']:
    df = pd.read_csv(f'/data/{study_name}_cgm_plus_features.csv')
    
    print(study_name)
    insert_dataframe_data(study_name, df, c)
    conn.commit()

conn.close()
This code generates a database with a single table with just over 25 million rows.

I then have the following code to obtain each time series subsequences I want for the list of (SID, date_time) pairs I need data for (on average there are between 10 to 100 pairs):
Python code:
ts_data = []
for SID, date_time in SID_date_time_pairs:
        ts_data += conn.execute("""
                                SELECT SID, date_time, bg_mg_per_dL from blood_glucose 
                                WHERE SID == :SID
                                AND datetime(date_time) >= datetime(:date_time)
                                AND datetime(date_time) =< datetime(:date_time, '+24 hours')
                                """,
                               {'SID': SID,
                               'date_time': str(date_time)}).fetchall()

return pd.DataFrame(ts_data, columns=['SID', 'Date_Time', 'Value'])
The problem is that searching for a time series subsequence from a single pair takes several seconds, where I thought it would be much faster (sub 1 second), so the overall for loop can take minutes to run if there are 100 pairs to iterate over. Am I doing something wrong?

fisting by many
Dec 25, 2009



Seventh Arrow posted:

I've been having a rough time figuring out how to get this to work - unfortunately since this is a Django thing with a bunch of interrelated files, this will have a lot of code snippets, sorry.

Basically I have a function that dumps a bunch of data into a csv. There's a text field where the client types in their name and this function pulls the "name" tag from the html form and makes that part of the filename. So far so good. But I also want to make the type of file part of the file name as well and it's behaving badly.

Here's the html that's being pulled: (full contents at pastebin)

So for filename = f"{name} {d_type} Issue Tracker.xlsx", the {name} part works ok, but {d_type} does not. There's a previous function in the same script that uses "type", as you can see here - this works ok.

I tried d_type = request.POST.get("type") as well as d_type = request.GET["type"] to no avail. Sometimes I get a "MultiValueDictKeyError when trying to retrieve "type"" error, but usually it just doesn't put the type into the filename. I am admittedly taking a shotgun approach here.

Is there maybe a simple way to test the results of request.GET["type"]? Thanks for any help.

If you want to take a look at the application, it's here: https://csvcheck971.pythonanywhere.com/

I would not use "type" as a variable name anywhere since its the name of a native Python function

request.GET contains GET parameters (eg. query strings), request.POST contains form data. If a form field is not required, you should get the value with request.POST.get('field'), this will return None if the field is not filled out rather than causing an error.

or, ideally, use a FormView (or at least a form) and avoid reinventing the wheel.

UraniumAnchor
May 21, 2006

Not a walrus.

Jose Cuervo posted:

The problem is that searching for a time series subsequence from a single pair takes several seconds, where I thought it would be much faster (sub 1 second), so the overall for loop can take minutes to run if there are 100 pairs to iterate over. Am I doing something wrong?

My first off the hip guess is that you want to add an index to whatever column(s) you're searching on, otherwise it's doing a linear scan. Try checking https://www.sqlite.org/lang_createindex.html and see if that improves your times.

Jose Cuervo
Aug 25, 2004

UraniumAnchor posted:

My first off the hip guess is that you want to add an index to whatever column(s) you're searching on, otherwise it's doing a linear scan. Try checking https://www.sqlite.org/lang_createindex.html and see if that improves your times.

That did the trick, thank you!

Seventh Arrow
Jan 26, 2005

fisting by many posted:

I would not use "type" as a variable name anywhere since its the name of a native Python function

I kind of figured as much, which is why I started using "d_type" as a variable.

quote:

request.GET contains GET parameters (eg. query strings), request.POST contains form data. If a form field is not required, you should get the value with request.POST.get('field'), this will return None if the field is not filled out rather than causing an error.

or, ideally, use a FormView (or at least a form) and avoid reinventing the wheel.

Thanks, I'm hoping something like this will set me on the right track https://www.geeksforgeeks.org/formview-class-based-views-django/

QuarkJets
Sep 8, 2008

I looked up whether sqlite has a built-in datetime or timestamp type, but I guess it doesn't. That's pretty lame.

timbro
Jul 18, 2004

Seventh Arrow posted:

I remember seeing a python learning site once that had a page that was something like "Why We Don't Recommend Zed Shaw/Learn Python The Hard Way", or something like that. I thought it was Reddit, but apparently not. Their points, if true, seemed to be pretty good reasons not to use his book(s).

I actually started with his book as my intro to programming and python since it seemed to be recommended everywhere. I've since used other stuff too.

It's actually really good for the beginner as it starts you off doing stuff in the command line and text files so you get a basic appreciation of what python is and does. The repetitive typing in of stuff seems to pay off too as you get familiar with what stuff should flow like.

I find it odd that it worked for me as normally I like the big concepts explaining first so I know why i'm trying to do something but this was more like, do this now figure out how it works.

If I had someone I know wanting to learn I'd probably still point them to it or something like codecademy. But i'm still a beginner really myself to wth do I know!

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost
If anyone here prefers a book over reading a browser, I've learnt python with Introducing Python by Bill Lubanovic and found it good.

And I haven't used the Python book from this series but I'm reading the SQL one and they approach teaching you the thing in a novel manner, that I am sure some of you will appreciate.

I have a regex problem.

code:
import re

import_regexer = re.compile('^.*import .+$|^from .+ import .+$|^.*include .+$') 
full_text = ''
with open ( '/correct/file/string/script.py', "r" ) as readfile : 
    full_text = readfile.read()
    
print(full_text)

match_obj = import_regexer.findall(full_text)
print(match_obj)

# This regex works right on filenames, in the same script: extension_regexer = re.compile('[.]sh|[.]php|[.]py|^[^.]+$')

I am opening a 1kb file that I know has the strings that I want. A regex tester website tells me my regex checks out.
And yet "match_obj = import_regexer.findall(full_text)" returns None all the time instead of a match object. I've tried raw strings. I've read the documentation. I am at a loss.

Thanks in advance.

nullfunction
Jan 24, 2005

Nap Ghost

Dawncloack posted:

I have a regex problem.

You're missing the re.MULTILINE flag:

Python code:
import_regexer = re.compile('^.*import .+$|^from .+ import .+$|^.*include .+$', flags=re.MULTILINE) 
That's what tells the regex module that you are feeding it multiple lines of text inside a single string. Otherwise, the caret would only match the start of the entire file, and the dollar sign the end of the file.

Obligatory "now you have two problems" and all that aside, I would rethink the way you've chosen to write your regex. Look into \s at the very least if you want to capture whitespace, using a .* or .+ when you actually want whitespace is guaranteed to wreck your day at some point in the future.

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.
Can anyone help me understand why this example ends in an error:

code:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test = test.fillna('Missing')
test['a'] = np.where((test['a'] == 'Missing'), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test
but this works as expected?

code:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test['a'] = np.where((test['a'].isna()), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test
I would think that both where statements would only return the population that doesn't satisfy the first condition to the second where statement, so the first code shouldn't be converting any strings to float if the second code isn't, but it doesn't seem to work that way

nullfunction
Jan 24, 2005

Nap Ghost

nullfunction posted:

now you have two problems

As a general rule if you want to interrogate Python scripts in any meaningful way, you should look into the ast module. It's part of the standard library, though I'd wager that someone's done the hard work and written parsers for php and bash as well and you could take a similar approach for other files that you may need to parse with a quick pip install.

Python code:
import ast
import pathlib

contents = pathlib.Path("sample.py").read_text()
abstract_syntax_tree = ast.parse(contents)

def analyze_import(stmt: ast.Import):
    modules = [n.name for n in stmt.names]
    print(f"Found an import statement on line {stmt.lineno}: imported module{'s' if len(modules) > 1 else ''}: {', '.join(modules)}")

def analyze_import_from(stmt: ast.ImportFrom):
    aliases = [n.name for n in stmt.names]
    print(f"Found an import statement on line {stmt.lineno}: imported alias{'es' if len(stmt.names) > 1 else ''} {', '.join(aliases)} from module: {stmt.module}")

for item in abstract_syntax_tree.body:
    if isinstance(item, ast.Import):
        analyze_import(item)
    elif isinstance(item, ast.ImportFrom):
        analyze_import_from(item)
Python code:
# sample.py
import json
import os, sys
from datetime import timedelta, datetime
...
code:
Found an import statement on line 2: imported module: json
Found an import statement on line 3: imported modules: os, sys
Found an import statement on line 4: imported aliases timedelta, datetime from module: datetime

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost
Those are awesome answers, I am certainly going to learn about \s and ast.

Thank you!

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Deadite posted:

Can anyone help me understand why this example ends in an error:

code:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test = test.fillna('Missing')
test['a'] = np.where((test['a'] == 'Missing'), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test
but this works as expected?

code:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[np.nan,1,2,3,4,5,6,np.nan],'b':[7,np.nan,8,9,10,np.nan,11,12]})
test = df[['a']].reset_index(drop=True)
test['a'] = np.where((test['a'].isna()), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
test
I would think that both where statements would only return the population that doesn't satisfy the first condition to the second where statement, so the first code shouldn't be converting any strings to float if the second code isn't, but it doesn't seem to work that way

In the first example you’re trying to call astype(float) on the string ‘Missing’

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.
I can tell that from the error, what I don't understand is why that string would have the method applied in the first place, since it should have been filtered out in the first part of the where function.

QuarkJets
Sep 8, 2008

Deadite posted:

I can tell that from the error, what I don't understand is why that string would have the method applied in the first place, since it should have been filtered out in the first part of the where function.

That's not the order of operations; the way you have this set up, astype occurs first, and that new dataset would become one of the inputs of the "where" function

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.

QuarkJets posted:

That's not the order of operations; the way you have this set up, astype occurs first, and that new dataset would become one of the inputs of the "where" function

So astype is the first part executed before the first "where" happens? I thought the first "where" statement would have filtered out the 'Missing' strings before the second where executes and applies the astypes

Jabor
Jul 16, 2010

#1 Loser at SpaceChem

Deadite posted:

So astype is the first part executed before the first "where" happens? I thought the first "where" statement would have filtered out the 'Missing' strings before the second where executes and applies the astypes

The order of operations might become clearer if you were to break out every subexpression and assign it to a variable, instead of packing it all into one line.

QuarkJets
Sep 8, 2008

Deadite posted:

So astype is the first part executed before the first "where" happens? I thought the first "where" statement would have filtered out the 'Missing' strings before the second where executes and applies the astypes

Let's go look at what `where` does
https://numpy.org/doc/stable/reference/generated/numpy.where.html

quote:

numpy.where(condition, [x, y, ]/)
Return elements chosen from x or y depending on condition.

Put another way, `where` creates a new array with elements selected from x or y, depending on some conditional. It does not actually transform any data; conditional, x, and y are not being modified. So you're really passing in 3 input arrays and getting 1 output array.

So what are conditional, x, and y in your case? Here's the line you wrote:
Python code:
test['a'] = np.where((test['a'] == 'Missing'), test['a'], np.where((test['a'].astype('float') < 3) | (test['a'].astype('float') > 5), 1000, test['a']))
This is a very long line of code, I'm going to change it a little bit. This doesn't do the same thing, but it experiences the same problem for the same reason:
Python code:
test['a'] = np.where((test['a'] == 'Missing'), test['a'], test['a'].astype('float'))
Let's assign the function arguments to variables, so we can more easily see how the input arrays are being formed:
Python code:
conditional = (test['a'] == 'Missing')
x = test['a']
y = test['a'].astype('float')
test['a'] = np.where(conditional, x, y)
This illustrates the actual order of operations; the input arguments must be formed before the function can execute. Now it should be easy to see what the problem is; y is created by applying `astype('float')` to every element of test['a']. That's not a valid operation for strings, so any elements that are 'Missing' will cause an exception to be raised.

Since y is one of the inputs to `where`, it actually needs to be formed first before `where` can really execute. It doesn't matter that you were doing this in-line, you were still creating that temporary array.

So what can you do about this? You can use boolean arrays to actually return a filtered array, operate on it, then write it back.

Python code:
a_not_missing = test['a'] != 'Missing'
valid_elements = test['a'][a_not_missing ]
valid_elements = np.where((valid_elements.astype('float') < 3) | (valid_elements.astype('float') > 5), 1000, valid_elements)
test['a'][a_not_missing ] = valid_elements
Or, done entirely with boolean arrays, and using `nan` checking directly instead of that goofy 'Missing' placeholder
Python code:
a_not_nan = ~np.isnan(x)
valid_elements = test['a'][a_not_nan]
valid_elements[valid_elements.astype('float') < 3] = 1000
valid_elements[valid_elements.astype('float') > 5] = 1000
test['a'][a_not_nan] = valid_elements

QuarkJets fucked around with this message at 05:37 on Nov 24, 2022

Seventh Arrow
Jan 26, 2005

The good news is that my manager liked that csv analyzer project that I've been talking about.

The bad news is that I'm now the office "python guy" and he dumped a huge ETL project in my lap, full of all kinds of things I don't know how to do, including:
  • pulling metadata dynamically from the salesforce API and put them into mysql tables
  • pulling all the fields that need to be validated for a specific dataset
  • making calls to various stored procedures in the database
  • shoving all the actual data into the salesforce API

Of course, I have a trillion questions, but for the sake of sanity I'm going to limit myself to a few:

i) He wants to pull (meta)data from a REST API and then put it into a table with columns and rows. This doesn't make sense to me - as far as I know, data is retrieved from APIs in JSON format and would more properly fit in a dictionary rather than columns and rows. Am I missing something, or is there a way to do this? (the metadata that I saw has a format like "{character limit: 16, decimal places: 2}", etc.)
ii) Is there some sort of awesome tutorial that will aid me in my journey of learning how to make python communicate with relational databases like MySQL?
iii) As far as I know, none of the metadata is actually tagged in the system as "metadata." Is there any strategy for simplifying this, or am I really just going to have to pull each and every single item by name and assign it to a variable?

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.
Oooooooh okay, I see now. Thank you for your very clear explanation. I didn't think about how all of the arguments in the "where" function needed to be resolved before the function executes, but that is how literally all functions work. I just couldn't see it in this case for whatever reason.

For context I was trying to write a "where" statement to replace values in a column that already contained the "Missing" strings, so I needed to find a way to filter those out, then apply the criteria to the intended population, and return all the values back to the dataframe with every value in its original index position.

QuarkJets
Sep 8, 2008

Seventh Arrow posted:

The good news is that my manager liked that csv analyzer project that I've been talking about.

The bad news is that I'm now the office "python guy" and he dumped a huge ETL project in my lap, full of all kinds of things I don't know how to do, including:
  • pulling metadata dynamically from the salesforce API and put them into mysql tables
  • pulling all the fields that need to be validated for a specific dataset
  • making calls to various stored procedures in the database
  • shoving all the actual data into the salesforce API

Of course, I have a trillion questions, but for the sake of sanity I'm going to limit myself to a few:

i) He wants to pull (meta)data from a REST API and then put it into a table with columns and rows. This doesn't make sense to me - as far as I know, data is retrieved from APIs in JSON format and would more properly fit in a dictionary rather than columns and rows. Am I missing something, or is there a way to do this? (the metadata that I saw has a format like "{character limit: 16, decimal places: 2}", etc.)
ii) Is there some sort of awesome tutorial that will aid me in my journey of learning how to make python communicate with relational databases like MySQL?
iii) As far as I know, none of the metadata is actually tagged in the system as "metadata." Is there any strategy for simplifying this, or am I really just going to have to pull each and every single item by name and assign it to a variable?

i) You could think of each field in the JSON as a column, and one row is whatever you get from 1 call to the API. Or you could be inserting several rows from 1 API call. It really depends on what the API is actually returning and how that relates to the tables in the database. It sounds like this database already exists, and your boss is asking you to automate a data entry task that people have to do by hand - if that's the case, then go talk to the people currently doing that and take notes.
ii) At the end of the day you probably want something like https://github.com/PyMySQL/PyMySQL, it's pretty easy to use. There's also SQLAlchemy, which I don't have experience using I just know that it's got a more Python-y interface whereas pymysql is more about executing whatever SQL you specify.
iii) First you should develop a model of the data you're receiving from the salesforce API and a model of the data you're handling with the database and its procedures. That's the hard but important part. Then you can build up your data structures around those details. Then you can write the functions that transform the data from one system to the other. Use a dictionary to store field name/value pairs if you want to do something basic, but a dataclass is the right approach for data structures with rigid schemas, such as MySQL tables and specific keys returned by API calls.

Seventh Arrow
Jan 26, 2005

QuarkJets posted:

i) You could think of each field in the JSON as a column, and one row is whatever you get from 1 call to the API. Or you could be inserting several rows from 1 API call. It really depends on what the API is actually returning and how that relates to the tables in the database. It sounds like this database already exists, and your boss is asking you to automate a data entry task that people have to do by hand - if that's the case, then go talk to the people currently doing that and take notes.
ii) At the end of the day you probably want something like https://github.com/PyMySQL/PyMySQL, it's pretty easy to use. There's also SQLAlchemy, which I don't have experience using I just know that it's got a more Python-y interface whereas pymysql is more about executing whatever SQL you specify.
iii) First you should develop a model of the data you're receiving from the salesforce API and a model of the data you're handling with the database and its procedures. That's the hard but important part. Then you can build up your data structures around those details. Then you can write the functions that transform the data from one system to the other. Use a dictionary to store field name/value pairs if you want to do something basic, but a dataclass is the right approach for data structures with rigid schemas, such as MySQL tables and specific keys returned by API calls.

Good advice, I will mull over this thank you!

duck monster
Dec 15, 2004

ExcessBLarg! posted:

That's a name I haven't seen in a long time.

He's not entirely wrong. Once you reach the size of Twitter (circa 2011) it might make sense to rewrite your Ruby backend in something more performant. But that's not to say that Ruby was the wrong choice for Twitter in 2006. IIRC Shaw wrote the Ruby-based web server that Twitter originally used so I don't know if he's salty about this.

There's a popular misconception that tech startups either grow to astronomical heights or crash and burn spectacularly. The reality is that there's lots of tech companies that operate in niche markets and have been running as effective small businesses for a decade or two now. As you say, hosting costs could be reduced but that's not usually your greatest efficiency gain and may even have negative effects elsewhere.

I'd say the vast majority either never make it to market, or remain small and profitable enough that it survives but not necessarily goes big.

What I've seen over 20+ years in the industry is that time and time again startups get locked in overly long development cycles and burn out all their cash without ever getting to the launch because they get stuck trying to engineer it for a billion customers when they havent got around to getting one customer yet. I'm always advising "Get the minimal viable product, and if we need to rapidly scale, then thats a good problem to have and we can hire some guys. Don't go overboard on features yet, just make it the basic product you promised the investors and we can add the juicy stuff AFTER launch", and don't get distracted by shitful buzz ideas like blockchains or whatever. And conceptually if you cant explain the thing in the time it takes an elevator to get to the lunch bar, the whole idea desparately needs a rethink.

And ultimately scaleable isn't THAT much harder than stick to the heroku 12 principles, stick pg_bouncer in front of yr database and MAYBE a NoSQL if a clear case can be made for it. Just jam the thing in a kubernetes autoscale group and blammo, you have something that'll scale up to the first million customers (which lets face it will be a miracle if it ever gets that big but its a nice thing to tell investors)

Uh. ADHD what am I talking about. Oh right , yeah my point is SUPPOSED to be that ultimately overly long dev times kill companies dead, because the REAL cost of the operation is always gonna be actually developing the thing (and if you make it to the end marketing the thing)

canyoneer
Sep 13, 2005


I only have canyoneyes for you

duck monster posted:

I'd say the vast majority either never make it to market, or remain small and profitable enough that it survives but not necessarily goes big.

Most businesses throughout human history have been impatient for profitability but patient for growth. Modern financial systems have made the inverse possible, with sometimes very weird results!

I am new to programming and Python and having some fun with it. I'm connecting to a Cloudera Impala ODBC source through Python. I'm using pyodbc library and it's working fine, but throws an angry warning that I should be using SQLAlchemy instead. Should I?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

iii) First you should develop a model of the data you're receiving from the salesforce API and a model of the data you're handling with the database and its procedures. That's the hard but important part. Then you can build up your data structures around those details. Then you can write the functions that transform the data from one system to the other. Use a dictionary to store field name/value pairs if you want to do something basic, but a dataclass is the right approach for data structures with rigid schemas, such as MySQL tables and specific keys returned by API calls.

+1 to this, because it's a very easy step to something skip past (and if you're using to writing one-off scripts, is something you generally do skip in those cases). But sitting down and really understanding what the data is you're handling is going to make it a ton easier to iterate on moving forward, instead of getting into dictionary hell where you find yourself passing around a gigantic dictionary of values you have no idea about.

FWIW: If the API you're using returns a massive pile of data for some reason, it's also okay to go 'alright it's going to give me back a dictionary with fifteen sub-entries, I'm going to discard 14 of those and only model the one I actually care about for this application. For example, I was working with a ticketing API, so call it and you get back all the data in the ticket, but I only really cared about one specific portion. No reason for me to write up proper object handling for all the bits I'm never going to interact with in this program, so I just started by something like this:

Python code:
results = get_ticket_data_from_api(ticketnumber)

access_details: AccessDetails = build_access_details_obj(results['access_details'])
Where build_access_details_obj() is a factory function that builds me a nicely modeled Python object for dealing with the access details.

Falcon2001 fucked around with this message at 01:20 on Nov 26, 2022

QuarkJets
Sep 8, 2008

Yeah don't be afraid to drop data that you don't actually care about. For instance

Seventh Arrow posted:

(the metadata that I saw has a format like "{character limit: 16, decimal places: 2}", etc.)

Do you actually care about these fields? I'm guessing not; there's probably no need to insert these into your database. But you can't really know without first understanding what's in your database. Don't assume anything.

Seventh Arrow
Jan 26, 2005

I'm probably going to need all of the metadata, yes. The idea is to have a dynamic ruleset that gets applied whenever we need to validate the client's actual data. So when the, say, account number is being validated, there should be an up-to-date rule saying that it has a max of 16 characters or whatever. Fortunately the validation is going to be done by stored procedures, which I will not be in charge of writing. All that my initial script has to do is pull the metadata from Salesforce and update the database with it. Fortunately, there's seems to be a module that makes interacting with salesforce easier:

https://pypi.org/project/simple-salesforce/

If I get in over my head, I can just see if I can hire help from fiverr I guess.

Adbot
ADBOT LOVES YOU

Ihmemies
Oct 6, 2012

Does anyone have an idea why the colors in vscode don't work? Like if I put some piece of code as #FF5050 in settings.json, it realizes as #9f4557 on my screen. It is so much lower in brightness and muted. I don't understand how inputting 1 color results in another color being displayed. Any tips?

I use bright colors like
code:
        "functions": "#FFF850",
        "keywords": "#50F8FF",
        "numbers": "#FF50FF",
        "strings": "#50F850",
        "types": "#50F8FF",
        "variables": "#FF50FF",
        "comments": "#50F850",
And they are all messed up. :sigh:

Like this. Qt left, VSC right:



Edit: --disable-gpu launch param seems to have helped.

Ihmemies fucked around with this message at 11:11 on Nov 26, 2022

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply