Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
it is
Aug 19, 2011

by Smythe

Bob Morales posted:

What the gently caress is a 'factory'? I know the Java joke is ObjectFactoryFactoryBeanFactory() but what is it?

A factory is just a class that lets you make instances of other classes without having to figure out which constructor to use yourself.

Let's say there's an interface called Shape, and it's got the methods draw() and getArea(). You may write a ShapeFactory class that returns an object of a different implementing class depending on what arguments you give its getShape() method.

So if you have the method, ShapeFactory.getShape(String shapeName, int size). When shapeName is "CIRCLE" it may return a Circle object with radius size, but when shapeName is "SQUARE" it may return a Square object with side length size. They'll have different draw() methods and different getArea() methods but they can both be generated by ShapeFactory.getShape().

So you do all the work in the beginning to figure out what lists of parameters lead to which constructors, and then you never worry about it again.

If you still don't get it maybe read this:
http://www.tutorialspoint.com/design_pattern/factory_pattern.htm

Adbot
ADBOT LOVES YOU

ExcessBLarg!
Sep 1, 2001

HappyHippo posted:

This is a classic example of Java forcing you to "noun your verbs"
That's an oversimplification, and not quite right either.

Java has static methods. So if you need a special method to create a Foo object a certain way, you can add it as a static method to the Foo class itself. Sure, "Foo" is a noun, but it's also a namespace. So that much is like any modern language.

The problem in Java (prior to Java 8) is that it didn't allow static methods in interfaces. So, if "Foo" is an interface with multiple implementations, you couldn't add static methods to create different flavor of Foo objects. The traditional solution was to have a separate class, FooFactory, to store the different factory methods for creating Foo objects. Sometimes the factory class name doesn't actually contain "Factory", like how the Collections class contains a bunch of methods for creating different kinds of Collection objects.

Java 8 allows for static methods in interfaces now, so Java no longer forces you to put factory methods in a separate class, but most APIs will probably continue to do for Java 6/7 compatibility.

HappyHippo posted:

(you can't just build something, you need to get a builder and then tell it to build).
You can "just build" something. However, since Java lacks keyword arguments, the Builder pattern avoids having to have constructors with many arguments, or having to pass in a "config" object. It's solving a different problem than the lack of global-scope methods.

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?

ExcessBLarg! posted:

That's an oversimplification, and not quite right either.

Java has static methods. So if you need a special method to create a Foo object a certain way, you can add it as a static method to the Foo class itself. Sure, "Foo" is a noun, but it's also a namespace. So that much is like any modern language.

The problem in Java (prior to Java 8) is that it didn't allow static methods in interfaces. So, if "Foo" is an interface with multiple implementations, you couldn't add static methods to create different flavor of Foo objects. The traditional solution was to have a separate class, FooFactory, to store the different factory methods for creating Foo objects. Sometimes the factory class name doesn't actually contain "Factory", like how the Collections class contains a bunch of methods for creating different kinds of Collection objects.

Java 8 allows for static methods in interfaces now, so Java no longer forces you to put factory methods in a separate class, but most APIs will probably continue to do for Java 6/7 compatibility.

You can "just build" something. However, since Java lacks keyword arguments, the Builder pattern avoids having to have constructors with many arguments, or having to pass in a "config" object. It's solving a different problem than the lack of global-scope methods.

Yes I was being a little facetious there. However a lot of tutorials on Design Patterns™ skip over little details like "do you even need this?" and people tend to cargo cult the poo poo out of them so it's important to have a critical eye. Case in point:


Ok first of all let's look at the factory method itself:
code:
 public Shape getShape(String shapeType){
      if(shapeType == null){
         return null;
      }		
      if(shapeType.equalsIgnoreCase("CIRCLE")){
         return new Circle();
         
      } else if(shapeType.equalsIgnoreCase("RECTANGLE")){
         return new Rectangle();
         
      } else if(shapeType.equalsIgnoreCase("SQUARE")){
         return new Square();
      }
      
      return null;
   }
Does this really do anything you couldn't have done with the constructors themselves? Not really. What's worse is that by passing a string you bypass the type system and introduce a new runtime failure mode (the function doesn't even throw an exception, it just returns null). You could solve this by defining an Enum at the cost of more code bloat as well as adding yet another point where the code must change whenever you introduce a new Shape. But I digress, let's look at the factory class itself:

code:
public class ShapeFactory {
	
   //use getShape method to get object of type shape 
   public Shape getShape(String shapeType){
      ...
   }
}
Notice how there is no state whatsoever in the ShapeFactory. getShape ought to be a static method. At some point I guess there could be some configuration in the ShapeFactory which would play a role but the tutorial doesn't discuss any of this. As a result you now need to create a pointless instance whenever you want to make a shape:
code:
public static void main(String[] args) {
      ShapeFactory shapeFactory = new ShapeFactory();

      //get an object of Circle and call its draw method.
      Shape shape1 = shapeFactory.getShape("CIRCLE");

      //call draw method of Circle
      shape1.draw();

	...
}
And all you wanted was a circle!

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe

ExcessBLarg! posted:

You can "just build" something. However, since Java lacks keyword arguments, the Builder pattern avoids having to have constructors with many arguments, or having to pass in a "config" object. It's solving a different problem than the lack of global-scope methods.

Builders also make it a lot more pleasant to create large numbers of similar objects, especially if they're immutable and can thus share internal state. You fill in the bulk of the builder with the stuff that's identical across all objects you're creating, then you tweak the few things that are different, create an object, tweak again, create another object, etc.

Gravity Pike
Feb 8, 2009

I find this discussion incredibly bland and disinteresting.

Bob Morales posted:

What the gently caress is a 'factory'? I know the Java joke is ObjectFactoryFactoryBeanFactory() but what is it?

An ObjectFactoryFactoryBeanFactory is obviously a class that has a method that produces an object with simple get/set methods that has a member that is a class that has a method that produces an object of a class that has a method that produces an object of class Object. I mean, it's right there in the name. :jerkbag:

The MUMPSorceress
Jan 6, 2012


^SHTPSTS

Gary’s Answer
I feel like I post this link a lot, but http://java.metagno.me/

A lot of the real horrorshow classes that give the builder paradigm a bad name come from Spring.

Tea Bone
Feb 18, 2011

I'm going for gasps.
EDIT: Never mind I was being an idiot and trying to use a Ruby Controller when I should have been using the model.

Tea Bone fucked around with this message at 14:25 on Mar 24, 2016

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

This is kind of a vague and open-ended question...

I'm trying to think of any actual or hyopthetical cool/interesting/novel uses of machine learning in line of business applications and I can't think of anything. I'm thinking like...something a store owner would use to manage inventory, a property owner would use to manage their property, a trucking company would use to manage their fleet, etc.

Anything come to anyone's mind?

Linear Zoetrope
Nov 28, 2011

A hero must cook

Thermopyle posted:

This is kind of a vague and open-ended question...

I'm trying to think of any actual or hyopthetical cool/interesting/novel uses of machine learning in line of business applications and I can't think of anything. I'm thinking like...something a store owner would use to manage inventory, a property owner would use to manage their property, a trucking company would use to manage their fleet, etc.

Anything come to anyone's mind?

The problem you're running into is probably that most of these tasks either don't have the data to train a good model, exact methods exist, good approximate methods exist, or they don't want to deal with something that spits out an answer that could be wrong and you have no idea how often it's wrong, how confident the prediction is, or why it made that guess. Most of the stuff would probably just be predicative recommendations, like "hey, your store sold a lot of Barbies, maybe you'd like to stock Monster High dolls too next season?" Or it just finds obvious answers. Like, my university once spent a few terms on a research project with the local fire department trying to optimize their dispatch routes, only to discover they didn't need an entire fire station. The fire chief knew this, it was only built because they got a grant on the insistence they build a new fire station. And the predictions were basically what their two or three dispatchers were doing already anyway, so the automation didn't provide any extra insight.

Personally I'm working on some research in the field of confident predictions -- an agent that can recognize when it has no earthly idea what it's doing and ask for help from an expert. The end goal is to make something that can safely automate things like power plants by taking the burden off human controllers, but not just go off the rails and send the core into meltdown when it's not sure what to do. But it's pretty nascent technology.

Edit: There's been some talk in my department about trying to develop agents for things like police or construction training simulations, it's being considered semi-seriously so at least some smart people think it's feasible, but there's not much work on it.

Linear Zoetrope fucked around with this message at 14:58 on Mar 24, 2016

Skandranon
Sep 6, 2008
fucking stupid, dont listen to me

Thermopyle posted:

This is kind of a vague and open-ended question...

I'm trying to think of any actual or hyopthetical cool/interesting/novel uses of machine learning in line of business applications and I can't think of anything. I'm thinking like...something a store owner would use to manage inventory, a property owner would use to manage their property, a trucking company would use to manage their fleet, etc.

Anything come to anyone's mind?

I think you need to first figure out if you could ever convince the business user that an AI would be helpful, and if you could ever recoup costs. Watson is being used for medical diagnosis and offering advice for lawyers, but that's really just serving as a specialized version of Google. Machine learning still requires some scale to be worthwhile.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Web-based line of business apps could have some sort of scale, but maybe not in a way that's useful. Imagine a truckers.com provides something where fleet managers can sign up for an account and input data about their trucks and truckers and the truckers can input data about their trips on the fly and devices attached to the trucks upload data...you could have some sort of scale across all of the accounts. Of course you have privacy/security implications to think about.

Anyway, I'm not actually proposing any solution, just wondering if anyone has done it or thought of doing it.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



I was thinking you could essentially print money if you could make a system that could learn to reliably transform business data. There's already tons of money in that space, so if you could undercut Data Transform Consultants Local Number 5032 by 10% with the kind of margins that you could get by spinning up instances on AWS/Azure on demand you'd probably be raking in billions in under five years. I kind of doubt it's possible to do with ML in its modern state, but I'm no expert.

Skandranon
Sep 6, 2008
fucking stupid, dont listen to me

Munkeymon posted:

I was thinking you could essentially print money if you could make a system that could learn to reliably transform business data. There's already tons of money in that space, so if you could undercut Data Transform Consultants Local Number 5032 by 10% with the kind of margins that you could get by spinning up instances on AWS/Azure on demand you'd probably be raking in billions in under five years. I kind of doubt it's possible to do with ML in its modern state, but I'm no expert.

Most of the difficulty of data transformation stuff ends up being either accommodating dumb business demands (has to always be upper case and have three trailing spaces), or actually understanding the rats nest people have gotten themselves into. The rules creation part is usually rather easy, it's the people surrounding things that make it so difficult.

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

Munkeymon posted:

I was thinking you could essentially print money if you could make a system that could learn to reliably transform business data. There's already tons of money in that space, so if you could undercut Data Transform Consultants Local Number 5032 by 10% with the kind of margins that you could get by spinning up instances on AWS/Azure on demand you'd probably be raking in billions in under five years. I kind of doubt it's possible to do with ML in its modern state, but I'm no expert.

You're basically talking about every ETL platform plus maybe Drools.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Unrelated to my brain droppings above, I did a stupid in WebStorm. I opened a .dust file and dismissed some message that popped up telling me the highlighting was being overridden by some setting thinking I'd just install a plugin and it wouldn't matter. Well, I installed the Dust plugin and I'm still not getting highlighting, so clearly I need to (un)tick a box somewhere but I don't see where. Does this sound familiar to anyone?

Blinkz0rz posted:

You're basically talking about every ETL platform plus maybe Drools.

But you still have to have a well-paid expert understand the rules well enough to describe the transforms in some formal language. I was thinking about using some before/after examples as training data and having the system derive the transform rules on its own. Again, it seems unlikely to be feasible (possible?) for the foreseeable future, especially given how weird edge cases crop up that would probably gently caress up the training. Maybe an intermediate step could be software that figures out some of the rules and auto-gens some code for an expert to start from to save some skilled labor, though.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


Thermopyle posted:

This is kind of a vague and open-ended question...

I'm trying to think of any actual or hyopthetical cool/interesting/novel uses of machine learning in line of business applications and I can't think of anything. I'm thinking like...something a store owner would use to manage inventory, a property owner would use to manage their property, a trucking company would use to manage their fleet, etc.

Anything come to anyone's mind?

I spent a couple years doing quantitative retail for a large company that you've heard of. It's not entirely clear what's machine learning and what's statistics, stochastic optimization or stochastic control, but there are a lot of fairly advanced methods being used in getting sweaters from a warehouse to you. The major areas where quants work that I can think of off the top of my head are sales forecasting, inventory management, shipping, pricing, competitor price discovery, ad placement, search and recommendations. There are bodies of literature describing how to solve any of those problems, so if you're looking to do something novel, you definitely have to argue that what you have in mind is better than the standard solutions.

Nobody ever really thought about applying machine learning to ETL. The problem there, in my experience, is that 80% of the data is regular enough to be described by simple rules, and the remaining 20% are a bunch of weird edge cases that you have to get right. That's really not the sort of scenario where an ML system is going to look good in terms of ROI.

Thermopyle posted:

Web-based line of business apps could have some sort of scale, but maybe not in a way that's useful. Imagine a truckers.com provides something where fleet managers can sign up for an account and input data about their trucks and truckers and the truckers can input data about their trips on the fly and devices attached to the trucks upload data...you could have some sort of scale across all of the accounts. Of course you have privacy/security implications to think about.

Anyway, I'm not actually proposing any solution, just wondering if anyone has done it or thought of doing it.

I'm not really sure how much of a market there is in providing services like what I did to smaller businesses. It'd have to be pretty cheap because they don't have the sort of problems where a reasonable solution is that much worse than the optimal solution, and the people who can do these sorts of models tend to be somewhat expensive. On the other hand, there are companies trying to do third party analytics, so maybe it can work.

Marx Headroom
May 10, 2007

AT LAST! A show with nonono commercials!
Fallen Rib
What is the difference between casting an object and calling a decorator function on an object? I had a dream where someone explained this to me and something clicked but I just woke up and forgot their argument. It had something to do with arguments and value vs reference.

Marx Headroom fucked around with this message at 15:10 on Mar 26, 2016

Pie Colony
Dec 8, 2006
I AM SUCH A FUCKUP THAT I CAN'T EVEN POST IN AN E/N THREAD I STARTED

Mr. Jive posted:

What is the difference between casting an object and calling a decorator function on an object? I had a dream where someone explained this to me and something clicked but I just woke up and forgot their argument. It had something to do with arguments and value vs reference.

Casting an object is re-interpreting some location in memory as a different type. You usually explicitly cast things when going from a less-specific (superclass) to a more specific (subclass) thing. This more specific thing usually something "additional" you want to be able to use. If you squint hard enough, this is kinda similar to decorators, which accept some thing (an object, a function) and usually add some functionality to it.

BigRedDot
Mar 6, 2008

Gravity Pike posted:

An ObjectFactoryFactoryBeanFactory is obviously a class that has a method that produces an object with simple get/set methods that has a member that is a class that has a method that produces an object of a class that has a method that produces an object of class Object. I mean, it's right there in the name. :jerkbag:

Just reminded of the old joke:

quote:

I had a problem and used Java.

Now I have a ProblemFactory

Shaocaholica
Oct 29, 2002

Fig. 5E
Not sure if this is the best place to ask but is BitTorrent resilient to any kind and magnitude of data corruption? For instance, if I have a completed torrent on disk and I randomly flip, delete and add bits all over the place, is there enough error correction data in the torrent to be able to correct all of those data alterations? Of course there's a point where the data will no longer resemble the original data at all but still, will BT just clobber the entire bad file or will it just throw its hands up?

Skandranon
Sep 6, 2008
fucking stupid, dont listen to me

Shaocaholica posted:

Not sure if this is the best place to ask but is BitTorrent resilient to any kind and magnitude of data corruption? For instance, if I have a completed torrent on disk and I randomly flip, delete and add bits all over the place, is there enough error correction data in the torrent to be able to correct all of those data alterations? Of course there's a point where the data will no longer resemble the original data at all but still, will BT just clobber the entire bad file or will it just throw its hands up?

I don't think the torrent itself has any error correction built in, but it will be able to detect that a given blocks hash is no longer good and then download that block again.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Skandranon posted:

I don't think the torrent itself has any error correction built in, but it will be able to detect that a given blocks hash is no longer good and then download that block again.

Yeah, the torrent file itself is basically just a list of filenames and metadata plus a checksum for each block of data. Barring hash collisions or absurd luck, anywhere between a single bit flipped in each block and a file of entirely random data will just end up resulting in the entire file being redownloaded from peers.

Shaocaholica
Oct 29, 2002

Fig. 5E

Plorkyeran posted:

Yeah, the torrent file itself is basically just a list of filenames and metadata plus a checksum for each block of data. Barring hash collisions or absurd luck, anywhere between a single bit flipped in each block and a file of entirely random data will just end up resulting in the entire file being redownloaded from peers.


Skandranon posted:

I don't think the torrent itself has any error correction built in, but it will be able to detect that a given blocks hash is no longer good and then download that block again.

Thanks. What if the file size is different which would result in blocks being offset? I guess that would just fall into the 'all blocks bad, get all blocks'? I guess that's really down to the fine details how how BT checks blocks.

Skandranon
Sep 6, 2008
fucking stupid, dont listen to me

Shaocaholica posted:

Thanks. What if the file size is different which would result in blocks being offset? I guess that would just fall into the 'all blocks bad, get all blocks'? I guess that's really down to the fine details how how BT checks blocks.

The torrent contains all the file meta data as well, so if the file size/name changed, then it would recognize it's not right, and a client would either throw an error or start getting all blocks again.

hooah
Feb 6, 2006
WTF?
I'm trying to implement the Newton-Raphson iteratively-reweighted least squares method of logistic regression in Matlab for homework. However, I'm finding that I'm encountering a lot of matrices which are "singular to working precision". The update function does this:


(For ease of typing I'll call Phi in the above image X.) I calculate y, R, and z, then plug everything in to the second line in that image. I've narrowed down the problematic matrix to X' * R * X. Is there some other way I should be doing this?

Linear Zoetrope
Nov 28, 2011

A hero must cook

hooah posted:

I'm trying to implement the Newton-Raphson iteratively-reweighted least squares method of logistic regression in Matlab for homework. However, I'm finding that I'm encountering a lot of matrices which are "singular to working precision". The update function does this:


(For ease of typing I'll call Phi in the above image X.) I calculate y, R, and z, then plug everything in to the second line in that image. I've narrowed down the problematic matrix to X' * R * X. Is there some other way I should be doing this?

Not unless X and R are square, in which case you can compute inv(X)*inv(R)*inv(X'), but R usually isn't square I don't think that's it. The usual culprits in logistic regression tend to be either a subtly incorrect logistic function, or not normalizing the features. Also make sure you don't have a flipped sign somewhere, that's bitten me before. It could also be that you're not detecting convergence quickly enough/doing too many iterations. What does the result look like if you just escape when you get the error and use the trained function? Does it have reasonable accuracy?

Linear Zoetrope fucked around with this message at 19:29 on Mar 28, 2016

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Given a log file with two rows like this:

code:
[line A] 2015-07-10T12:33:40.799310Z test-rainwalk-net 91.200.12.138:65358 10.10.10.24:80 0.00005 0.00586 0.000057 404 404 0 1245 "GET [url]http://www.rainwalk.net/index.php/page/8/[/url] HTTP/1.1" "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0\"" - -
[line B] 2015-07-10T12:35:17.709088Z test-rainwalk-net 207.46.13.72:5393 10.10.11.161:80 0.000045 0.858576 0.000063 200 200 0 7605 "GET [url]http://www.rainwalk.net:80/index.php/page/8/[/url] HTTP/1.1" "Mozilla/5.0 (compatible; bingbot/2.0; +[url]http://www.bing.com/bingbot.htm[/url])" - -
take note of the following sections where the browser agent is specified.

[line A] "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0\""
[line B] "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

(the [line A] and [Line B] tags I have added in and are not part of the log file)

I am trying to import each line into a database table where double quotes indicate the start and end of a text string, and the weird (\ ") shape at the front and back of the string in [line A] are breaking the import process. There are other patterns in the thousands of log files I am trying to import involving nested double quotes that break my text identified strngs.

I would like to identify a regex string that will eliminate the nested double quotes from within the text string, without stripping the double quotes in a "good" string. Or is there a better way of making the strings consistent?

nielsm
Jun 1, 2009



Agrikk posted:

I would like to identify a regex string that will eliminate the nested double quotes from within the text string, without stripping the double quotes in a "good" string. Or is there a better way of making the strings consistent?

If you have a CSV parsing library available, and you can configure all delimiters it uses, try telling it to use space as field delimiter, and double quotes with backslash escapes for quoted strings. That should get you correct handling.

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe

Agrikk posted:

I would like to identify a regex string that will eliminate the nested double quotes from within the text string, without stripping the double quotes in a "good" string. Or is there a better way of making the strings consistent?

Jesus Christ, no. You want to bind your parameters into a query, and let your query library handle escaping quotes, etc. for you. In general, if you are generating a query by inserting variables into strings, You Are What Is Wrong With Databases.

A proper query gen should look something like this:
code:
dbConn.execute("INSERT INTO MyTable VALUES (?, ?, ?, ?)", param1, param2, param3, param4)
Obviously varying depending on the specifics of the language and DB you're working with, but if your database doesn't support bound variables (I'm not even aware of any that lack this feature) then you need to ditch it and use something that does.

mystes
May 31, 2006

TooMuchAbstraction posted:

Jesus Christ, no. You want to bind your parameters into a query, and let your query library handle escaping quotes, etc. for you. In general, if you are generating a query by inserting variables into strings, You Are What Is Wrong With Databases.
I want to somehow believe the question was about parsing the string to insert different parts of it into different columns, but then the database part would have been irrelevant to the question, so I don't know what to think.

Lysidas
Jul 26, 2002

John Diefenbaker is a madman who thinks he's John Diefenbaker.
Pillbug

Your Φ isn't rank-deficient, is it? ΦT R Φ will be singular if so, and that's a simple case that's worth ruling out early.

Lysidas fucked around with this message at 20:13 on Mar 28, 2016

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

TooMuchAbstraction posted:

Jesus Christ, no. You want to bind your parameters into a query, and let your query library handle escaping quotes, etc. for you. In general, if you are generating a query by inserting variables into strings, You Are What Is Wrong With Databases.

A proper query gen should look something like this:
code:
dbConn.execute("INSERT INTO MyTable VALUES (?, ?, ?, ?)", param1, param2, param3, param4)
Obviously varying depending on the specifics of the language and DB you're working with, but if your database doesn't support bound variables (I'm not even aware of any that lack this feature) then you need to ditch it and use something that does.

mystes posted:

I want to somehow believe the question was about parsing the string to insert different parts of it into different columns, but then the database part would have been irrelevant to the question, so I don't know what to think.

This is correct. SSIS requires that I have columns defined for the ETL job where the columns match up with the delimiters (in this case the delimiter is a space, and text strings are wrapped with double quotes. Because there exist nested double quotes, the ETL job is parsing columns incorrectly and the import fails.

All I'm looking for is to remove any nested double quote.

Agrikk fucked around with this message at 20:28 on Mar 28, 2016

nielsm
Jun 1, 2009



Agrikk posted:

This is correct. SSIS requires that I have columns defined for the ETL job where the columns match up with the delimiters (in this case the delimiter is a space, and text strings are wrapped with double quotes. Because there exist nested double quotes, the ETL job is parsing columns incorrectly and the import fails.

All I'm looking for is to remove any nested double quote.

As I wrote, any chance of handling it as if it was CSV, just with spaces instead of commas? Because then you should get the quote handling for free.

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe
Okay, sorry for leaping to bad conclusions, but I heard "use a regex to fix my database input" and everything went red for a moment there.

So what exactly are you looking to do here? Just remove the quotation mark? If it's always preceded by a \ then that's pretty easy:
code:
cat log.txt | sed 's/\\"//g'
If you want something more complicated then you'll need to find a way to distinguish between the "good" quotation marks that delimit entries, and the "bad" ones that are within entries. That'll probably depend on the structure of your log file.

Do nested quotation marks only ever show up in the last "entry" on the line? Then you can do a greedy match up to the last quotation mark that precedes a "bad" mark, split the line into "good" and "bad" entries, and remove quotation marks from the "bad" section.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

nielsm posted:

As I wrote, any chance of handling it as if it was CSV, just with spaces instead of commas? Because then you should get the quote handling for free.

No, that won't work because the userAgent string can also be a random number of words that would then get broken up into a random number of columns.


TooMuchAbstraction posted:

Okay, sorry for leaping to bad conclusions, but I heard "use a regex to fix my database input" and everything went red for a moment there.

So what exactly are you looking to do here? Just remove the quotation mark? If it's always preceded by a \ then that's pretty easy:
code:
cat log.txt | sed 's/\\"//g'
If you want something more complicated then you'll need to find a way to distinguish between the "good" quotation marks that delimit entries, and the "bad" ones that are within entries. That'll probably depend on the structure of your log file.

Do nested quotation marks only ever show up in the last "entry" on the line? Then you can do a greedy match up to the last quotation mark that precedes a "bad" mark, split the line into "good" and "bad" entries, and remove quotation marks from the "bad" section.

:) No problem.

The problem is the random nature of the user agent string. It is wrapped by quotes to indicate it is a string, but it can have a number of patterns within it, including varying number of words, included double and single quotes and whatnot, and it doesn't appear at the end of the line.

What I'm trying to avoid is having the ETL process bomb out and then me having to add another find/replace special case to a preprocessing step that filters out the weird double quote combinations.

hooah
Feb 6, 2006
WTF?

Lysidas posted:

Your Φ isn't rank-deficient, is it? ΦT R Φ will be singular if so, and that's a simple case that's worth ruling out early.

Yeah, evidently it is. The original data is from a spam database which has 4,600 examples and 57 attributes and one class label. For whatever reason, the professor modified the data "by splitting above and below the mean count". This results in 114 attributes and 2 class labels (which don't seem to have any relation to the labels file he gave us, but whatever). I'm assuming the rank-deficiency is due to this meddling with the data, since the rank of the data matrix is only 58. I'll ask him tomorrow what we're supposed to do with that weird format. Thanks for helping me narrow it down.

nielsm
Jun 1, 2009



Agrikk posted:

No, that won't work because the userAgent string can also be a random number of words that would then get broken up into a random number of columns.

Really?

1. Replace \" sequences with "" sequences
2. Import as CSV with space as field delimiter

Works in Excel at least, and if you're using MS SSIS I'd assume it uses compatible CSV reader code. Other CSV readers may use different escaping rules for quote marks.






Yes I'm really set on this idea. If you can use a real parser, do so. The format seems to be written for CSV-style reading.

nielsm fucked around with this message at 21:30 on Mar 28, 2016

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

nielsm posted:

Really?

1. Replace \" sequences with "" sequences
2. Import as CSV with space as field delimiter

Works in Excel at least, and if you're using MS SSIS I'd assume it uses compatible CSV reader code. Other CSV readers may use different escaping rules for quote marks.





Hrm...

Closer. But SSIS barfs on one of the quotes:



MS is notoriously bad at consistency between SQL Server, Excel, Access, etc in terms of how it handles data. :argh:

Oh well, I think I'll just create a library of substitutions to make. :(

No Safe Word
Feb 26, 2005

TooMuchAbstraction posted:

code:
cat log.txt | sed 's/\\"//g'

You suffer from the same affliction I try to rid myself of: gratuitous use of cat (sed can take a filename as an arg)

Other commands I used to do that poo poo with regularly: less, grep, and their z* variants

Adbot
ADBOT LOVES YOU

TooMuchAbstraction
Oct 14, 2012

I spent four years making
Waves of Steel
Hell yes I'm going to turn my avatar into an ad for it.
Fun Shoe

No Safe Word posted:

You suffer from the same affliction I try to rid myself of: gratuitous use of cat (sed can take a filename as an arg)

Other commands I used to do that poo poo with regularly: less, grep, and their z* variants

You're right, of course. :doh: But I didn't know about zless, zgrep, etc. That's cool!

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply