Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
NinpoEspiritoSanto
Oct 22, 2013




huhu posted:

Edit: I'm doing a bunch of dumb crap. Taking a break. Carrry on.

Another rubber duck success story

Adbot
ADBOT LOVES YOU

Plasmafountain
Jun 17, 2008

oT5Zy7Jd3WXPRzHYbLnV
WQZWfGm6IJUHtt3hswFH
z05tMY0KQeVQ4zU8Ec4z
NaxFaPgzGuN0TJ4Ohcgh
cK0Gb6oRFqNaRvB6xt61
2cOCE8IQ13Rp8ydEzYkd
Zehjkah8p2yl1hn4eJXC
gFHETLuTQmTEwxkmzOZU
W3Z58FA9LzG25hskWPRg
wpHInNA9BpD4QySVkoOQ

Plasmafountain fucked around with this message at 23:58 on Feb 27, 2023

susan b buffering
Nov 14, 2016

Zero Gravitas posted:

Is there a way to create pandas dataframes in a class object? (I think thats the term.)

Im trying to create a stock control system with a bunch of dataframes for holding data. I thought I'd create a class so I can start a bunch of dataframes all with the same column headings and append data to them later.

Plainly I'm doing something wrong with my class creation since I simply get an object that doesnt inherit the pd.DataFrame options.


Be gentle, this isnt my usual day job.
You need to assign the dataframe to an instance variable, like so:

code:
self.df = pd.DataFrame(columns=columnNames)
Then you can access the dataframe from within the instance using self.df, but you need to include `self` as the argument to your instance methods. Here's a version of your code that hopefully helps you get started:

code:

import os
import pandas as pd

class Inventory():
    
    """
    Class for inventory operations
    
    - initialise
    - add item
    - remove item
    - add qty
    - remove qty
    
    """
    def __init__(self):
        
        columnNames = ["SKU", "BATCH","ORDERNO", "DESCRIPTION","VALUE", "QTY"]
        
        self.df = pd.DataFrame(columns=columnNames)
        
        
    def addItem(self, item):
        #bad example

        self.df.append(item)
    
    def subtractItem(self):
        #TODO
        pass
    
    def addQty(self):
        #TODO
        pass
    
    def subtractQty(self):
        #TODO
        pass

The example code I put in addItem was mainly to show how you would access the dataframe. Most of my experience using pandas has been analyzing preexisting datasets so I've know idea if "append" is the correct method to actually use.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

skull mask mcgee posted:

You need to assign the dataframe to an instance variable, like so:

code:
self.df = pd.DataFrame(columns=columnNames)
Then you can access the dataframe from within the instance using self.df, but you need to include `self` as the argument to your instance methods. Here's a version of your code that hopefully helps you get started:

code:

import os
import pandas as pd

class Inventory():
    
    """
    Class for inventory operations
    
    - initialise
    - add item
    - remove item
    - add qty
    - remove qty
    
    """
    def __init__(self):
        
        columnNames = ["SKU", "BATCH","ORDERNO", "DESCRIPTION","VALUE", "QTY"]
        
        self.df = pd.DataFrame(columns=columnNames)
        
        
    def addItem(self, item):
        #bad example

        self.df.append(item)
    
    def subtractItem(self):
        #TODO
        pass
    
    def addQty(self):
        #TODO
        pass
    
    def subtractQty(self):
        #TODO
        pass

The example code I put in addItem was mainly to show how you would access the dataframe. Most of my experience using pandas has been analyzing preexisting datasets so I've know idea if "append" is the correct method to actually use.

I've never thought to use a df in a class before, good example.

If youre adding one df of N rows to the end of another (think add new row in a db) you use pd.concat([df1,df2], ignore_index=True/False)

susan b buffering
Nov 14, 2016

CarForumPoster posted:

I've never thought to use a df in a class before, good example.

If youre adding one df of N rows to the end of another (think add new row in a db) you use pd.concat([df1,df2], ignore_index=True/False)

Yeah, concat is right.

I hadn’t really thought of doing what the OP is doing either. The closest I’ve come is having a method return a dataframe, with the data coming from a sqlite database.

lazerwolf
Dec 22, 2009

Orange and Black
Why not use a dictionary? A dataframe seems like unnecessary overkill

code:

class Inventory():
    def __init__(self):
        self.inventory = {}

    def addItem(self, data):
        # let's pretend data is some tuple
        sku, batch, orderno, desc, value, qty = data

        # Some error handling if SKU exists
        self.inventory[sku] = {
        "BATCH" : batch,
        "ORDERNO": orderno,
        "DESCRIPTION": desc,
        "VALUE": value,
        "QTY": qty
        }


    def subtractItem(self, sku):
        # Error handling if key doesn't exist
        del self.inventory[sku]

    def addQty(self, sku, qty):
        item = self.inventory.get(sku)
        # error handling if item doesn't exist
        item['QTY'] += qty
        self.inventory[sku] = item

    def subtractQty(self, sku, qty):
        item = self.inventory.get(sku)
        # error handling if item doesn't exist
        item['QTY'] -= qty
        # error handling if this reaches below 0
        self.inventory[sku] = item 

Dictionary of dictionaries and perform the proper updates and error handling as necessary. if you want to even convert to a dataframe you can from a dict of dicts with this one liner
code:
df = pd.DataFrame.from_dict(self.inventory, orient='index')
and write to csv easily
code:
df.to_csv('filepath', sep=',')

lazerwolf fucked around with this message at 17:56 on Mar 15, 2020

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Question: How should I one hot encode some categorical features where one DF element can contain multiple categories?

I have a string list of dept of state executive titles in a DF element. They can appear in any order.

For example:
code:
df= pd.DataFrame({
0:CEO,COO
1:MGR,MGR,MGR
2:CEO,CTO,CFO,COO
3:P,Chariman
})
I want to one hot encode this feature such that for the 7 unique titles I have 7 columns with a 1 or 0 for whether that title appears in that row. I dont know how many uniques there are, but there are a max of 7 per row.

e.g. desired output:
code:
index | Title List | CEO | COO | MGR | CTO | CFO | P | Chairman
0 CEO,COO | 1 | 1 | 0 | 0 | 0 | 0 | 0
Normally I'd use pd.get_dummies() on a column, is there a best way to do this? My data set is roughly 300k rows.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

CarForumPoster posted:

Question: How should I one hot encode some categorical features where one DF element can contain multiple categories?
Normally I'd use pd.get_dummies() on a column, is there a best way to do this? My data set is roughly 300k rows.

Haven't tested it on data set yet but olution to share w/thread. I thought it was nifty. Pandas is just so loving powerful.

code:
df=
index	titles
0	CEO,COO
1	MGR,MGR,MGR
2	CEO,CTO,CFO,COO
3	P,Chariman

# split strings into columns
df2 = pd.concat([df, df['titles'].str.split(",", expand=True)], axis="columns")

# gives me integer named columns in df2, get a list of all unique values in those
new_cols = pd.concat([df2[col] for col in df2.columns if isinstance(col, int)]).unique()
new_cols = new_cols.tolist()

# make those unique values columns, dropping any NaNs
# check whether those strings exist in the original column and one hot encode
# convert True/False to int
for col in new_cols:
    if col:
        print(col)
        df2[col] = df2['titles'].str.contains(col).astype(int)

larper
Apr 9, 2019
2 hour quarantine project

https://gist.github.com/blindstitch/368940ed98993b78e07b75a66b2b6cb5

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I want to visualize how calls get routed in my company by looking at a count or the % thathe go through each hop. I have 1 call per row with all the hops.

Something like:
code:
	Callee0	Callee1	Callee2	Callee3	Callee4	Callee5	Callee6	Callee7
10	CPXML	RingGroup (1003)	1001	NaN	NaN	NaN	NaN	NaN
178	CPXML	RingGroup (1003)	Voicemail (1003)	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
4616	CPXML	RingGroup (1003)	1004	NaN	NaN	NaN	NaN	NaN
4642	CPXML	1005	NaN	NaN	NaN	NaN	NaN	NaN
My desired output is something like a decision tree:


I know graphviz/pydot/fastdot can make decision tree graphs from SKlearn, but any advice on how to do it for this application?

EDIT:

Solved it, solution was to do a groupby with .size() as the aggfunc then plot with fastdot (graphviz/pydot wrapper)

Solution:
code:

steps = df1.filter(regex='Callee\d$')
steps = steps.fillna("Ended")
sg = steps.groupby(steps.columns.tolist(), 
              as_index=False,
              squeeze=True
             ).size().reset_index()
sg = sg.sort_values(by=[0], ascending=True)

from fastdot import Dot, seq_cluster

g = Dot()
for index, row in sg[-10:].iterrows():
    g.add_item(seq_cluster(list(row[:-1].unique()), f"Total: {row[0]} Ratio: {row[0]/sg[0].sum()}"))

g.write_jpeg('top10calls.jpg')

CarForumPoster fucked around with this message at 18:57 on Mar 23, 2020

susan b buffering
Nov 14, 2016

Think I'm fully onboard the attrs train now. The ability to define a converter function for attributes is extremely convenient when dealing with nested data structures.

mr_package
Jun 13, 2000
Can you show an example? I just use asdict(). Requires hashable (frozen=True) for something that will nest as keys though. But has the advantage of just dumping the whole thing to json-able (simplejson that is) string. And then loading it back in and converting back with cattrs structure_attrs_from_dict(object, attrs_class_type) works perfectly. Unless you use forward annotations.

https://github.com/Tinche/cattrs/issues/80

But I also like that decorators work with attrs (they are broken in @dataclasses). I haven't needed getters/setters/deleters with attrs classes but I do use @property.

Macichne Leainig
Jul 26, 2012

by VG
What’s a good place to ask dumb machine learning questions?

I’ve got a training pipeline set up for image segmentation and it works quite well overall, but due to memory constraints on my GPU I had to dial my batch size down to 1.

I have a total set of 100 training images so I said screw it and pumped it up to 2, and that made my val_loss super jittery and the purported accuracy stopped around 0.8, whereas with a batch size of 1 it’s not jittery and the accuracy metric goes up to 0.93. The resulting mask from predictions is much less accurate too.

This is one of those things where I definitely know enough to just get into trouble, and I’m just curious why increasing the batch size to 2 would affect that.

QuarkJets
Sep 8, 2008

A larger batch size is more likely to converge into a sharp minimum, which can cause overall worse testing accuracy. Your training dataset is very small, what's your testing dataset size?

Macichne Leainig
Jul 26, 2012

by VG

QuarkJets posted:

A larger batch size is more likely to converge into a sharp minimum, which can cause overall worse testing accuracy. Your training dataset is very small, what's your testing dataset size?

It’s 100 total images with an 80/20 split for training/validation, but I’m using the Unity3D image synthesis plugin to generate training data so I can always generate more. I figured 100 was plenty to get an idea of whether or not I could do anything meaningful.

Dominoes
Sep 20, 2007

Showerthought project: Create a python interpreter that's a subset of the current one:
- Dramatically trimmed-down std lib
- Some struct-likes removed or streamlined. (ie consolidate namedtuple, NamedTuple, dataclass, TypedDict, class etc)
- (related to above) All classes have freebee dunder methods like __repr__, and constructors (make dataclass default, and don't require the @?)
- Remove some secondary functional APIs like for enum
- Won't run unless it passes the type checker like mypy (perhaps with a flag to disable/enable?)
- Doesn't accept third-party modules that don't use wheel format and/or don't specify dependencies/metadata properly

More about removing than adding. I haven't dug into the Python code yet. Dumb idea? Where would you start?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Dominoes posted:

Showerthought project: Create a python interpreter that's a subset of the current one:
- Dramatically trimmed-down std lib
- Some struct-likes removed or streamlined. (ie consolidate namedtuple, NamedTuple, dataclass, TypedDict, class etc)
- (related to above) All classes have freebee dunder methods like __repr__, and constructors (make dataclass default, and don't require the @?)
- Remove some secondary functional APIs like for enum
- Won't run unless it passes the type checker like mypy (perhaps with a flag to disable/enable?)
- Doesn't accept third-party modules that don't use wheel format and/or don't specify dependencies/metadata properly

More about removing than adding. I haven't dug into the Python code yet. Dumb idea? Where would you start?

Why

huhu
Feb 24, 2006
https://micropython.org/ ?

Dominoes
Sep 20, 2007

Hah I guess. Was thinking any target. Just a wishlist for a cleaned up Py. Have you tried Micropython? What do you think?

Dominoes fucked around with this message at 01:41 on Apr 1, 2020

mr_package
Jun 13, 2000
I think this article nails it: https://glyph.twistedmatrix.com/2019/06/kernel-python.html

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!

Dominoes posted:

Showerthought project: Create a python interpreter that's a subset of the current one:
- Dramatically trimmed-down std lib
- Some struct-likes removed or streamlined. (ie consolidate namedtuple, NamedTuple, dataclass, TypedDict, class etc)
- (related to above) All classes have freebee dunder methods like __repr__, and constructors (make dataclass default, and don't require the @?)
- Remove some secondary functional APIs like for enum
- Won't run unless it passes the type checker like mypy (perhaps with a flag to disable/enable?)
- Doesn't accept third-party modules that don't use wheel format and/or don't specify dependencies/metadata properly

More about removing than adding. I haven't dug into the Python code yet. Dumb idea? Where would you start?

My personal experience trying to recreate much of the Python user experience in a custom interpreter is that there's a lot of moon magic hiding under the hood that comes out and destroys your life. Things you think you can eliminate--or even just compromise--for the sake of parsimony eventually explode to their true forms. I guess if you're trying to start from CPython and just snip, then that's one thing. It'll probably crash on some dangling bit though. I haven't even pondered most of the data types you even listed, but I can contrast the super type versus a regular class. Super type is a sneaky little fucker. Internally it overrode __getattribute__ but you have to dig to figure out that's what it's doing. If you casually poke it with a stick in the REPL then you think you're seeing different results.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
It's called rpython and it powers pypy op

Macichne Leainig
Jul 26, 2012

by VG
Hey, I figured out my neural network problem.

Turns out if you preprocess and normalize the images going into the neural network, you also have to preprocess and normalize images going into the prediction workflow.

Who’d a thunk? Not me :saddowns:

That said, I’ve been reading a lot about neural networks and feel much more confident about my knowledge. I’m not going to be breaking any records or doing anything noteworthy but I at least know what overfit and underfit mean now.

pmchem
Jan 22, 2010


Depending on what you’re doing, you can build in normalization to the network, e.g.:
https://keras.io/layers/normalization/

that’s sometimes more convenient

hhhmmm
Jan 1, 2006
...?

Dominoes posted:

Showerthought project: Create a python interpreter that's a subset of the current one:
- Dramatically trimmed-down std lib
- Some struct-likes removed or streamlined. (ie consolidate namedtuple, NamedTuple, dataclass, TypedDict, class etc)
- (related to above) All classes have freebee dunder methods like __repr__, and constructors (make dataclass default, and don't require the @?)
- Remove some secondary functional APIs like for enum
- Won't run unless it passes the type checker like mypy (perhaps with a flag to disable/enable?)
- Doesn't accept third-party modules that don't use wheel format and/or don't specify dependencies/metadata properly

More about removing than adding. I haven't dug into the Python code yet. Dumb idea? Where would you start?

Consider the effort required vs the potential upside..? Why would you do that?

There is this weird pattern for programmers moving into Python with experience from large code bases in other languages. They start focusing on fixing 'problems' with python to future proof applications. Which would matter if we were building this massive monolith of a code base that will be used for 25 years. But usually the requirements is just to create~100 lines of code we can move into a docker container to solve some small problem.

hhhmmm
Jan 1, 2006
...?

Protocol7 posted:

Hey, I figured out my neural network problem.

Turns out if you preprocess and normalize the images going into the neural network, you also have to preprocess and normalize images going into the prediction workflow.

Who’d a thunk? Not me :saddowns:

That said, I’ve been reading a lot about neural networks and feel much more confident about my knowledge. I’m not going to be breaking any records or doing anything noteworthy but I at least know what overfit and underfit mean now.

You probably know this already, but be sure to use n-fold validation to estimate performance of your pipelines. Also, include ALL the choices made in creating that pipelines. It is tempting to omit part of the preprocessing in the validation step for faster results, but the preprocessing steps matters a lot. Sometimes even more than model tuning. Don't just use n-fold validation for model performance, use it to estimate performance for the entire pipeline.

Nippashish
Nov 2, 2005

Let me see you dance!
Tracking stupid statistics about data flowing through your pipeline is surprisingly useful for catching mistakes like this. The mean pixel value entering the network is pretty opaque on its own, but is immediately suspicious when it's different between training and prediction.

Data Graham
Dec 28, 2009

📈📊🍪😋



Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"}

Dominoes
Sep 20, 2007

hhhmmm posted:

Consider the effort required vs the potential upside..? Why would you do that?

There is this weird pattern for programmers moving into Python with experience from large code bases in other languages. They start focusing on fixing 'problems' with python to future proof applications. Which would matter if we were building this massive monolith of a code base that will be used for 25 years. But usually the requirements is just to create~100 lines of code we can move into a docker container to solve some small problem.
Probably not worth it. I get the impression most people using Python fire up whatever Python comes with their version of Linux or what their company uses and rolls with that. We're sitting on a mess that's built up over time, as things naturally do. No one wants to tidy it up; weighing heavily the downsides without considering the upsides. The article mr_package posted is nice.

Given how long folks stayed on Py2, I'm not optimistic.

Dominoes fucked around with this message at 21:57 on Apr 2, 2020

breaks
May 12, 2001

Data Graham posted:

Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"}

It makes more sense if you remember that foo[bar] is basically just a nicer looking way to write foo.__getitem__(bar), regardless of foo’s type.

(And of course it's actually a little more complicated than that when you consider assignment and del, and the corresponding __setitem__ and __delitem__)

a foolish pianist
May 6, 2007

(bi)cyclic mutation

Data Graham posted:

Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"}

I think it's a useful analogy with array indices.

QuarkJets
Sep 8, 2008

Data Graham posted:

Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"}

Those characters do opposite things, nominally {} creates dict entries and [] fetches them. So the real wonkiness is in the fact that foo["bar"] suddenly becomes a setter if you put it on the left side of an equal sign, when it's normally a getter in other contexts

Dominoes
Sep 20, 2007

-

Macichne Leainig
Jul 26, 2012

by VG

Nippashish posted:

Tracking stupid statistics about data flowing through your pipeline is surprisingly useful for catching mistakes like this. The mean pixel value entering the network is pretty opaque on its own, but is immediately suspicious when it's different between training and prediction.

Yeah, I definitely get that. With NNs it’s pretty easy to see the effects of garbage in, garbage out. Thankfully I’ve been slowly adding more robustness to the pipeline so in general so it helps normalize a lot of that stuff.

If anyone else ever deals with image segmentation this library is super powerful in my own limited experience.

https://github.com/qubvel/segmentation_models

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Protocol7 posted:

Yeah, I definitely get that. With NNs it’s pretty easy to see the effects of garbage in, garbage out. Thankfully I’ve been slowly adding more robustness to the pipeline so in general so it helps normalize a lot of that stuff.

If anyone else ever deals with image segmentation this library is super powerful in my own limited experience.

https://github.com/qubvel/segmentation_models

You'll likely enjoy this part of the fast.ai deep learning course as they cover image localization/segmentation:

https://www.youtube.com/watch?v=MpZxV6DVsmM

Bonus that you can compare it to segmentation_models for output, ease of use, etc. Its PyTorch based rather than TF and is also pretty easy to use.

Spaced God
Feb 8, 2014

All torment, trouble, wonder and amazement
Inhabits here: some heavenly power guide us
Out of this fearful country!



Can someone help point me in the direction of some good resources on using SQL in Python, or alternatively give some advice?

My internship's big project for me involves a ~60x200,000 excel spreadsheet that requires very specific, weird analysis that no one has really wrapped their head around how to do, specifically involving simultaneous spatial and temporal analysis between different entries, so they tossed it to me. Originally I tried loving with it in Pandas but I'm running out of ways to break it over my knee in a way that gets what I want out of it so I thought about turning to SQL or some other relational DB. My background is in GIS and I've more or less been learning python and pandas on the fly

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
From what you’re describing it doesn’t sound like putting the data into a SQL database is going to help you. It might be a better option for storing/retrieving the data in future (though you should think long and hard before creating a table with 200,000 columns), but if you’ve already got the data in a pandas data frame and don’t know what to do next it is not going to help.

Some more detail would be useful. What is the structure of your data and what analysis are you trying to run on it?

QuarkJets
Sep 8, 2008

Yeah SQL isn't going to help you find new data analysis solutions, Pandas tables basically replicate what you can produce with SQL tables anyway

Spaced God
Feb 8, 2014

All torment, trouble, wonder and amazement
Inhabits here: some heavenly power guide us
Out of this fearful country!



DoctorTristan posted:

From what you’re describing it doesn’t sound like putting the data into a SQL database is going to help you. It might be a better option for storing/retrieving the data in future (though you should think long and hard before creating a table with 200,000 columns), but if you’ve already got the data in a pandas data frame and don’t know what to do next it is not going to help.

Some more detail would be useful. What is the structure of your data and what analysis are you trying to run on it?
I'll try to explain in vagueries while hopefully giving the scope of the problem. I don't want anyone knowing a comedy web forum helped me be a good intern lol

I work with people who get called out from central areas all over a city to various places. If every person in one central area's territory is busy, another person from a nearby territory responds to any new calls in that saturated area. It gets more complex because sometimes multiple people get called out to a site, as well as a few other variables. We have a big business and the spreadsheet is automatically generated from the dispatching system with a ton of data including dates and times and locations of where the person is going, and that's the data source.

Currently what my code does is take the sheet (which I've done some manual excel magic on to help the process, more on that later) and put it into a pandas dataframe, do some joins to gather all the data of what person belongs where into one df, and then queries every instance of where a person went to a place out of their territory dependent on a few extra variables via an embarrassing long conditional query. It's a very vague query and not entirely representative and that shows because it includes like 60% of the whole table. I'm trying to find a way to figure out, for example, if when this happens, was every available person already occupied determined via timestamp analysis, but I have no idea how that would work in pandas.

Likewise, I've been trying to find a way to help sanitize or validate data conditionally, too. Ideally is run a for loop through pandas to look at our in house codes for each territory on a row by row basis and strip useful info from them (they're six digit characters, with each pair representing something about the territory in decreasing scale ie city, neighborhood, street). That stuff I've mostly been doing in excel manually, which sucks but works, but ideally I want that to be automated.

Hopefully that explains what I'm doing? Monday mornings are never my strong point

Adbot
ADBOT LOVES YOU

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Spaced God posted:

I'll try to explain in vagueries while hopefully giving the scope of the problem. I don't want anyone knowing a comedy web forum helped me be a good intern lol

I work with people who get called out from central areas all over a city to various places. If every person in one central area's territory is busy, another person from a nearby territory responds to any new calls in that saturated area. It gets more complex because sometimes multiple people get called out to a site, as well as a few other variables. We have a big business and the spreadsheet is automatically generated from the dispatching system with a ton of data including dates and times and locations of where the person is going, and that's the data source.

Currently what my code does is take the sheet (which I've done some manual excel magic on to help the process, more on that later) and put it into a pandas dataframe, do some joins to gather all the data of what person belongs where into one df, and then queries every instance of where a person went to a place out of their territory dependent on a few extra variables via an embarrassing long conditional query. It's a very vague query and not entirely representative and that shows because it includes like 60% of the whole table. I'm trying to find a way to figure out, for example, if when this happens, was every available person already occupied determined via timestamp analysis, but I have no idea how that would work in pandas.

Likewise, I've been trying to find a way to help sanitize or validate data conditionally, too. Ideally is run a for loop through pandas to look at our in house codes for each territory on a row by row basis and strip useful info from them (they're six digit characters, with each pair representing something about the territory in decreasing scale ie city, neighborhood, street). That stuff I've mostly been doing in excel manually, which sucks but works, but ideally I want that to be automated.

Hopefully that explains what I'm doing? Monday mornings are never my strong point

I can only give vague advice on this kind of vague info, but it sounds like you are trying to do something quite difficult and have a moderate-to-severe case of running before you can walk.

A few things stand out:

When you say that your data is 60*200000, did you get that the wrong way round or do you really have 60 rows and 200,000 columns?

What does each row in your data frame represent? Is it an event where someone is called out, with date/time/location?

Can you do more basic queries, such as ‘how many call outs were there on June 19 2019?’ Or ‘how many call outs were there in total on region X during 2018?’ Can you do visualisations of these?

It sounds like your extraction, data cleaning and analysis steps are getting quite mixed up with each other - this is invariably a bad idea.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply