Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
necrotic
Aug 2, 2005
I owe my brother big time for this!

Empress Brosephine posted:

ooo that's a good idea using Python to do it. Thanks!

Maybe I should learn one day what language VS extensiosn are programmed in and just make one for myself...all it needs to do is scan the .html for "class=" or "Class=" or whatever typing of it and then dump what follows until the next line break!

Thanks!

They are written in javascript.

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Empress Brosephine posted:

ooo that's a good idea using Python to do it. Thanks!

Maybe I should learn one day what language VS extensiosn are programmed in and just make one for myself...all it needs to do is scan the .html for "class=" or "Class=" or whatever typing of it and then dump what follows until the next line break!

Thanks!

Tangentially related but I've used regex101.com dozens of times so far for regex building stuff. Pretty useful site because of its ability to easily paste test cases

Macichne Leainig
Jul 26, 2012

by VG

CarForumPoster posted:

Tangentially related but I've used regex101.com dozens of times so far for regex building stuff. Pretty useful site because of its ability to easily paste test cases

+1 Regex101.com, I love that it can explain how the regular expression works as well. That was the piece I needed to be able to wrap my head around regex.

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
I'm going to check that out, thanks

CarForumPoster
Jun 26, 2013

⚡POWER⚡
FWIW 50+ times over ~4 years using regexs to accomplish something and I still don’t understand them well enough to write one from scratch. I just always Google->stack overflow->regex101 if needed-> test in code.

It’s a crutch.

Slimy Hog
Apr 22, 2008

CarForumPoster posted:

FWIW 50+ times over ~4 years using regexs to accomplish something and I still don’t understand them well enough to write one from scratch. I just always Google->stack overflow->regex101 if needed-> test in code.

It’s a crutch.

I usually either write a simple one by hand then move to regex101 to test a bunch of edge cases and fix my inevitable mistakes or skip all that and jump to regex101 and use the sidebar to write a regex.

If anyone tells you that they can write flawless regex without help us lying to you.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
No, I think a lot of people have learned regular expressions well enough to code them without help. That doesn't mean that they never make mistakes or that they know every detail of every extension in PCRE, but well enough to rattle off something basic like ^(?:\w+:\w+,)*\w+:\w+$, sure. It's just another programming language, one with at most 20 features you actually need to remember: \, ., \w, \d, \s, \S, [], [^], ^, $, (), ?:, |, *, *?, +, +?, maybe {}. I can never remember the lookahead or back-reference stuff, but some people really swear by it.

Like anything else, it's hard to retain and build expertise if you're not using it for months at a time, but if you ever start needing it day-to-day, you'll learn it quickly enough.

rjmccall fucked around with this message at 06:33 on May 1, 2021

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.
I can do pretty good, though I also enjoy regex crosswords so maybe I'm special.

I could never remember what \b and \w do until I tried out the Execute Program regex course. Somehow that seared it into my brain. It teaches other parts of regexes too!

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
I can never remember the language-specific extensions so I always have to look them up every time. for example the syntax for named subpatterns is different between C# and Python and I have to look it up whenever I use it in either. I should really write myself a cheat sheet or something I guess.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
Also I don't know whether the point was lost but regular expressions aren't a suitable solution for the general problem of extracting class names from an HTML page. I suggested that as a solution because it was a specific page that the poster wanted to do that to, and that page is unlikely to present false positives or if it does you could just work around them. If you had to do it for pages in general then you should use an HTML parsing library.

Super-NintendoUser
Jan 16, 2004

COWABUNGERDER COMPADRES
Soiled Meat
The only time I have to usually deal with regex is with SSO, since the identity provider typically has usernames stored differently than the legacy systems I'm trying to integrate and I can use regex to manipulate the name (turn f.lastname into firstname.lastname). Doing this regex is an absolute nightmare, but it's possible. Typically I just use a different feature and map the names separately since I can't trust the customer's ldap anyways. Once in a while a guy comes along with a misspelled principle name, or he's lastname.f for some reason and the regex doesn't work.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
Step one of learning something is to stop trying to convince yourself that it's impossible to learn.

Step two of learning regular expressions is probably to get in the habit of using them for find/replace in your editor of choice as often as possible. It'll be slower than whatever you're currently doing at first, but you won't get better at writing regexps without just spending time doing it.

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

Jerk McJerkface posted:

The only time I have to usually deal with regex is with SSO, since the identity provider typically has usernames stored differently than the legacy systems I'm trying to integrate and I can use regex to manipulate the name (turn f.lastname into firstname.lastname). Doing this regex is an absolute nightmare, but it's possible. Typically I just use a different feature and map the names separately since I can't trust the customer's ldap anyways. Once in a while a guy comes along with a misspelled principle name, or he's lastname.f for some reason and the regex doesn't work.

oh man that doesn't sound like a good idea but you gotta do what you gotta do i guess

KillHour
Oct 28, 2007


I personally like https://regexr.com/

Add me to the list of people who use regex on a weekly basis and still need to look up basic poo poo.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


I like https://www.debuggex.com/ but I don't do anything particularly complicated.

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?

rjmccall posted:

No, I think a lot of people have learned regular expressions well enough to code them without help. That doesn't mean that they never make mistakes or that they know every detail of every extension in PCRE, but well enough to rattle off something basic like ^(?:\w+:\w+,)*\w+:\w+$, sure. It's just another programming language, one with at most 20 features you actually need to remember: \, ., \w, \d, \s, \S, [], [^], ^, $, (), ?:, |, *, *?, +, +?, maybe {}. I can never remember the lookahead or back-reference stuff, but some people really swear by it.

Like anything else, it's hard to retain and build expertise if you're not using it for months at a time, but if you ever start needing it day-to-day, you'll learn it quickly enough.

Regexs are like a programming language, although one with a syntax akin to brainfuck.

nielsm
Jun 1, 2009



Sure, regular expressions have a terse syntax, but the structure would be the same if you used a keyword-based syntax with whitespace between tokens and such, but it would also be so much more to write. Some regex syntaxes do allow whitespace to not be significant and even comments, but still use the common symbols for operators.

Super-NintendoUser
Jan 16, 2004

COWABUNGERDER COMPADRES
Soiled Meat

Bruegels Fuckbooks posted:

oh man that doesn't sound like a good idea but you gotta do what you gotta do i guess

The issue here (which I agree is bad) is that I've been doing replatforming of a CMS that has an internal identity pool/structure/permissions already inside it.

The replatforms typically also include using SSO w/SAML. It's not a hard technical task, since our SAML module is a standard open source one.

The issue is that users used to access the client via their username inside the client. Typically firstname.lastname. However when they go to saml, usually it's the LDAP name or auth token from SAML that contains a different assertion ID, and I don't know what it is. Sometimes this is initial.lastname or it's from before they were married or divorced and the names don't match. Or in ldap they are just an ID number and their people name isn't used.

In our latest release we just added an alias field to the internal user identity config. So they do a saml auth, and then the cms checks the assertion Id against the user saml alias field, and then impersonates the name of the actual user so the legacy config/permissions still work.
We also provide an API that they can pipe names/identities etc into it. It works well now and there's no regex required.

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?

nielsm posted:

Sure, regular expressions have a terse syntax, but the structure would be the same if you used a keyword-based syntax with whitespace between tokens and such, but it would also be so much more to write. Some regex syntaxes do allow whitespace to not be significant and even comments, but still use the common symbols for operators.

My point is that with programming languages it's generally considered a good trade-off if the syntax is more readable but takes longer to write. Sometimes I think that attitude is taken too far, but certainly regexs have a reputation for being difficult to read, and I don't consider that surprising given the syntax.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


In a lot of cases the problem is that the logic you're trying to implement is inherently complicated. Consider the following regex for email address validation taken from the accepted answer to How to validate an email address using a regular expression?:
code:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
That's very hard to understand, but it's no shorter than it needs to be to cover the standard. Anything that implements all parts of that standard is going to be similarly complex. Maybe a different syntax would make the different parts more apparent, but the logic just isn't simple.

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
Generally speaking, you're better off using a selection of multiple regular expressions combined with some actual code in your programming language of choice.

Or a formal grammar and a parser generator.

Knowing when it's no longer appropriate to write "a regex" as your solution is part of being able to use regexes effectively.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



You validate an email address by sending an email with a link to click IMO, but that's not a regex issue.

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

Munkeymon posted:

You validate an email address by sending an email with a link to click IMO, but that's not a regex issue.

And this lets you turn your regex into something like \S+@\S+ which is much easier to remember!

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

pokeyman posted:

And this lets you turn your regex into something like \S+@\S+ which is much easier to remember!

you can technically have spaces in an email address if you want. something like "My Butt"@somethingawful.com is a valid email address (the quotes are part of the address)

I think .+@.+ or some variant like [^@]+(@[^@]*)*@[^@]+ would be ok but for all I know they're not, the rules for email addresses are extremely permissive.

tbh if you really need to check whether an email address is well-formed you probably want to use a finite state machine, not a regular expression (because of the quoting rules); and if you want to check whether it is valid you should just try sending email to it.

of course none of this stops you saying "gently caress your fancy-rear end email address that you only created for the sake of being technically correct, I'm arbitrarily disallowing it" and if you're in a position to do that then hell, you should.

Macichne Leainig
Jul 26, 2012

by VG

Hammerite posted:

you can technically have spaces in an email address if you want. something like "My Butt"@somethingawful.com is a valid email address (the quotes are part of the address)

Technically sure, but basically every email service out there prevents you from registering an email address with a space in it, so I'm not gonna support some weird rear end edge case.

I think it makes sense to do some basic validations to make sure the input data looks like an email address, but beyond that implementing a regex that validates an email address down to the exact email address specifications doesn't really provide a tangible benefit.

Anyway, you are obviously on the same page about dumb "technically correct for the sake of being technically correct" email addresses so I digress.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
my email address is @@@.at don't @ me

Spikes32
Jul 25, 2013

Happy trees
Are there any programs or websites out there that will let me automatically diagram out IIF statements (and ideally work in reverse too)? I work with a dumb program that gives me access to limited customization options and that's the most used one, but following IIF statements 5/6 levels deep with multiple branches is a real pain.

raminasi
Jan 25, 2005

a last drink with no ice

Hammerite posted:

tbh if you really need to check whether an email address is well-formed you probably want to use a finite state machine, not a regular expression (because of the quoting rules)

Regular expressions are a language for encoding finite state machines (weird backreference stuff aside)

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


raminasi posted:

Regular expressions are a language for encoding finite state machines (weird backreference stuff aside)

The two representations are equivalent in power but sometimes one representation is considerably more simple than another. Think about the set of all binary strings that contain an even number of 1s and an odd number of 0s. That's very easy to describe as an FSM but the regex for it isn't quite as simple.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

raminasi posted:

Regular expressions are a language for encoding finite state machines (weird backreference stuff aside)

After looking on wikipedia, it turns out "finite state machine" has a more restrictive definition than I thought. What I thought qualified as a finite state machine is actually a "pushdown automaton". I thought that a finite-state machine could have finitely many variables (e.g. an integer telling you how many levels of nesting you have entered), apparently not? It seems weird to conceptualise that as being backed by a "stack" with symbols from the singleton alphabet but I guess that's what wiki is telling me.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


Finite state machines in an introductory theory class are very restricted--they're not even pushdown automata because they don't have stacks--but finite state machines in general are not. There are models for those as well but they don't get taught early on.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

ultrafilter posted:

Finite state machines in an introductory theory class are very restricted--they're not even pushdown automata because they don't have stacks--but finite state machines in general are not. There are models for those as well but they don't get taught early on.

so are you saying that "finite state machine" has 2 different meanings? how come? Wikipedia's article on "finite-state machine" defines them as being strictly less capable than pushdown automata.

My degree is in maths and stats, all my cs knowledge I picked up as a hobbyist or later on the job, it doesn't come from introductory classes but it is limited in some areas.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
Formally, a finite state machine is characterized by storing a statically-bounded amount of information. A pushdown automata can store unbounded information because the stack can grow without limit.

Programmers often say “finite state machine” for the design pattern of a finite control: a system that receives discrete events and switches on some enumerated internal state to decide how to respond, and where a response can include changing that enumerated state. It’s not meant to imply that the total amount of information stored is finite, though; and formally, even Turing machines are built around a finite control.

nielsm
Jun 1, 2009



A finite state machine definitely can't have any variables, other than the current state. If you want one with variables that all have a finite set of valid values then sure, you can build a multidimensional matrix with each variable in one dimension, then enumerate all positions in the matrix into new state values (in a single dimension), and have a stupidly complex state machine. At least as a mathematical model.
If you want one with variables that have a (conceptually) infinite range of valid values, then your state machine is no longer finite.

Push-down automatons is a different class of computational capability. They have a stack of infinite capacity, so they do have an infinite number of possible states.

As far as I remember, Turing machines are the next step up in computational capability.

You can make the argument that digital computers are finite state machines, they just tend to have somewhere around 21000000000000 different states which (in isolation) makes them pretty good at pretending to be Turing machines, but on the other hand they actually also have I/O devices that let them output data to external systems of unknown capabilities, and that output could potentially affect the input stream, so I'd argue they are fully Turing complete given an appropriate I/O device. (Thematically appropriate would be a tape station with some kind of automatic cutting and splicing to make it look like there was a spool with infinite capacity in either direction.)

Edit: ^ above me is better at being concise

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


rjmccall posted:

Formally, a finite state machine is characterized by storing a statically-bounded amount of information. A pushdown automata can store unbounded information because the stack can grow without limit.

Programmers often say “finite state machine” for the design pattern of a finite control: a system that receives discrete events and switches on some enumerated internal state to decide how to respond, and where a response can include changing that enumerated state. It’s not meant to imply that the total amount of information stored is finite, though; and formally, even Turing machines are built around a finite control.

This is the distinction I was getting at. There are also different types of finite state machines that people study to develop theories about various systems. For instance, timed automata are very important for reasoning about real-time systems.

Yaoi Gagarin
Feb 20, 2014

There are linear bounded automata, in between pushdown and turning machines.

Each of these machine classes corresponds to a type of language in the chomsky hierarchy:
FSM = regular language
Pushdown automata = context free language
LBA = context sensitive language
Turing machine = unrestricted language

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


Finite state machines and Turing machines accept the same class of languages whether you allow nondeterminism or not, nondeterministic pushdown automata are strictly more powerful than the deterministic ones, and it's still an open question for linear bounded automata (because no one really cares).

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

Hammerite posted:

so are you saying that "finite state machine" has 2 different meanings? how come? Wikipedia's article on "finite-state machine" defines them as being strictly less capable than pushdown automata.

My degree is in maths and stats, all my cs knowledge I picked up as a hobbyist or later on the job, it doesn't come from introductory classes but it is limited in some areas.

the class where all the cs majors learn this poo poo is called "theory of computation" or "automata theory."

it was the first course i ever took where the professor assigned his own textbook. the warning sign was the subtitle "a gentle introduction" - that was a lie, the class was not gentle.

Gothmog1065
May 14, 2009
Is there a string length limit that regex cannot handle? Namely in older systems (It is monk code, based off of lisp).

Adbot
ADBOT LOVES YOU

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
If there is, it would be specific to that particular system, rather than a general regex thing.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply