From chet.ramey at case.edu  Wed Mar  1 00:53:44 2023
From: chet.ramey at case.edu (Chet Ramey)
Date: Tue, 28 Feb 2023 09:53:44 -0500
Subject: [COFF] [TUHS] Re: Generational development [was Re: Re: Early
 GUI on Linux]
In-Reply-To: <CAEoi9W4sq7r_azuNbogDcfDTM2xy7WPnQ6-n+de-b-iCVhHqUg@mail.gmail.com>
References: <16241ceb-fe92-7f25-bda0-0b327847728d@case.edu>
 <B7F6403D-E276-490B-AB11-835141F31339@iitbombay.org>
 <vNaSB1ygm5HY-rV-WScmTmerF0acmZicvrUsW4kpDQ-n0-rpXSNQTh9V6mMHVLEbH6cjpXIQrHM8U4Oc4e6vzzA1sGF2eM9lxXqUbEn2bfc=@protonmail.com>
 <735c811e-62ce-5384-b83f-a3887baac89d@case.edu>
 <CAEoi9W6F__U=TVSkPgNbBHYyjhYPQjHPnMPoOvaz6QPF466w0Q@mail.gmail.com>
 <5a7aa991-7656-3faf-b34a-d613736716fd@case.edu>
 <CAEoi9W59EKkki1CUx18jVKBZ-_EJS3kmYDbcDRUXmUauiQ_H+g@mail.gmail.com>
 <f62b85a2-7212-a601-7cb3-d0cd5a38c0f3@case.edu>
 <CAEoi9W4sq7r_azuNbogDcfDTM2xy7WPnQ6-n+de-b-iCVhHqUg@mail.gmail.com>
Message-ID: <708986db-d22e-3b1b-7dad-c15025697e42@case.edu>

On 2/27/23 7:28 PM, Dan Cross wrote:

> Huh? Rustup is the context that this came up in:

I think if you look back in the thread, you'll find that the message from
segaloco was a reply to a message of mine where I criticized the practice
of piping from `wget' to `sh'. That's the context.


>> But just because you don't run `sudo sh' when using
>> `rustup' doesn't mean there aren't a disturbingly large number of
>> installers -- or whatever -- for which that is the recommended workflow.
>>
>> Nor does the fact that `rustup' is a safe example mean that this is a safe
>> practice in general. I posit that it's a bad idea in general to blindly
>> run scripts you download from the Internet, and it's especially bad to
>> do it as root. Depending on how you accept risk, you can choose to do
>> things about it, but that's often not part of recommendations.
> 
> I cannot help but point out that this is moving the goalposts somewhat
> from the specific context that I was responding to. If we're now
> talking about things in general then I agree with you.

We were talking about the general practice before Matt used `rustup' as a
specific example. I'm glad we agree it's a bad idea.


>> In any case, if you want
>> to, you can have a workflow where you rebuild configure yourself.
> 
> This is true, but then there's the autotools source stuff that you've
> got to inspect as well, and on and on.

Sure, there's always a limit to where trust takes over. It's ultimately
who you trust to do the packaging: is it your distro/OS vendor, your
package manager (e.g., macports, homebrew), free software distributors
(e.g., signed tar files from gnu.org), or the authors themselves?

> Or perhaps they just cargo-cult it and don't
> really think about it, which (I think) hews closer to the argument
> that folks here have been making.

That's pretty close to the point I was making originally.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet at case.edu    http://tiswww.cwru.edu/~chet/


From crossd at gmail.com  Wed Mar  1 01:25:04 2023
From: crossd at gmail.com (Dan Cross)
Date: Tue, 28 Feb 2023 10:25:04 -0500
Subject: [COFF] [TUHS] Re: Generational development [was Re: Re: Early
 GUI on Linux]
In-Reply-To: <708986db-d22e-3b1b-7dad-c15025697e42@case.edu>
References: <16241ceb-fe92-7f25-bda0-0b327847728d@case.edu>
 <B7F6403D-E276-490B-AB11-835141F31339@iitbombay.org>
 <vNaSB1ygm5HY-rV-WScmTmerF0acmZicvrUsW4kpDQ-n0-rpXSNQTh9V6mMHVLEbH6cjpXIQrHM8U4Oc4e6vzzA1sGF2eM9lxXqUbEn2bfc=@protonmail.com>
 <735c811e-62ce-5384-b83f-a3887baac89d@case.edu>
 <CAEoi9W6F__U=TVSkPgNbBHYyjhYPQjHPnMPoOvaz6QPF466w0Q@mail.gmail.com>
 <5a7aa991-7656-3faf-b34a-d613736716fd@case.edu>
 <CAEoi9W59EKkki1CUx18jVKBZ-_EJS3kmYDbcDRUXmUauiQ_H+g@mail.gmail.com>
 <f62b85a2-7212-a601-7cb3-d0cd5a38c0f3@case.edu>
 <CAEoi9W4sq7r_azuNbogDcfDTM2xy7WPnQ6-n+de-b-iCVhHqUg@mail.gmail.com>
 <708986db-d22e-3b1b-7dad-c15025697e42@case.edu>
Message-ID: <CAEoi9W5d0pgjEqkvkq4thOWB3oYnP1mSXfB7dNH+wHHw0s=E6Q@mail.gmail.com>

On Tue, Feb 28, 2023 at 9:53 AM Chet Ramey <chet.ramey at case.edu> wrote:
> On 2/27/23 7:28 PM, Dan Cross wrote:
> > Huh? Rustup is the context that this came up in:
>
> I think if you look back in the thread, you'll find that the message from
> segaloco was a reply to a message of mine where I criticized the practice
> of piping from `wget' to `sh'. That's the context.

Yes, it is quite clear we were speaking past one another.

        - Dan C.

From chet.ramey at case.edu  Wed Mar  1 02:03:47 2023
From: chet.ramey at case.edu (Chet Ramey)
Date: Tue, 28 Feb 2023 11:03:47 -0500
Subject: [COFF] [TUHS] Re: Generational development [was Re: Re: Early
 GUI on Linux]
In-Reply-To: <CAEoi9W5d0pgjEqkvkq4thOWB3oYnP1mSXfB7dNH+wHHw0s=E6Q@mail.gmail.com>
References: <16241ceb-fe92-7f25-bda0-0b327847728d@case.edu>
 <B7F6403D-E276-490B-AB11-835141F31339@iitbombay.org>
 <vNaSB1ygm5HY-rV-WScmTmerF0acmZicvrUsW4kpDQ-n0-rpXSNQTh9V6mMHVLEbH6cjpXIQrHM8U4Oc4e6vzzA1sGF2eM9lxXqUbEn2bfc=@protonmail.com>
 <735c811e-62ce-5384-b83f-a3887baac89d@case.edu>
 <CAEoi9W6F__U=TVSkPgNbBHYyjhYPQjHPnMPoOvaz6QPF466w0Q@mail.gmail.com>
 <5a7aa991-7656-3faf-b34a-d613736716fd@case.edu>
 <CAEoi9W59EKkki1CUx18jVKBZ-_EJS3kmYDbcDRUXmUauiQ_H+g@mail.gmail.com>
 <f62b85a2-7212-a601-7cb3-d0cd5a38c0f3@case.edu>
 <CAEoi9W4sq7r_azuNbogDcfDTM2xy7WPnQ6-n+de-b-iCVhHqUg@mail.gmail.com>
 <708986db-d22e-3b1b-7dad-c15025697e42@case.edu>
 <CAEoi9W5d0pgjEqkvkq4thOWB3oYnP1mSXfB7dNH+wHHw0s=E6Q@mail.gmail.com>
Message-ID: <ba729d21-cf3c-92da-ee7e-6100a7d3b752@case.edu>

On 2/28/23 10:25 AM, Dan Cross wrote:
> On Tue, Feb 28, 2023 at 9:53 AM Chet Ramey <chet.ramey at case.edu> wrote:
>> On 2/27/23 7:28 PM, Dan Cross wrote:
>>> Huh? Rustup is the context that this came up in:
>>
>> I think if you look back in the thread, you'll find that the message from
>> segaloco was a reply to a message of mine where I criticized the practice
>> of piping from `wget' to `sh'. That's the context.
> 
> Yes, it is quite clear we were speaking past one another.

OK, let's not do that any more. :-)

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet at case.edu    http://tiswww.cwru.edu/~chet/


From lars at nocrew.org  Thu Mar  2 16:41:31 2023
From: lars at nocrew.org (Lars Brinkhoff)
Date: Thu, 02 Mar 2023 06:41:31 +0000
Subject: [COFF] [TUHS] Re: Unix v7 icheck dup problem
In-Reply-To: <CAD2gp_R8CjSYGwtniQyHkfvR9aSfoozJ3qisqbwHDCocknubhg@mail.gmail.com>
 (John Cowan's message of "Wed, 1 Mar 2023 20:56:12 -0500")
References: <20230302013628.8E40618C07B@mercury.lcs.mit.edu>
 <CAD2gp_R8CjSYGwtniQyHkfvR9aSfoozJ3qisqbwHDCocknubhg@mail.gmail.com>
Message-ID: <7wsfenslic.fsf@junk.nocrew.org>

John Cowan <cowan at ccil.org> writes:
>>  which Rob Austein re-wrote into "Alice's PDP-10". 
> I didn't know that one was done at MIT.

This spells out the details:
https://www.hactrn.net/sra/alice/alice.glossary

From coff at tuhs.org  Fri Mar  3 04:54:49 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Thu, 2 Mar 2023 11:54:49 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in grep.
Message-ID: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>

Hi,

I'd like some thoughts ~> input on extended regular expressions used 
with grep, specifically GNU grep -e / egrep.

What are the pros / cons to creating extended regular expressions like 
the following:

    ^\w{3}

vs:

    ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)

Or:

    [ :[:digit:]]{11}

vs:

    ( 1| 2| 3| 4| 5| 6| 7| 8| 
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) 
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]

I'm currently eliding the 61st (60) second, the 32nd day, and dealing 
with February having fewer days for simplicity.

For matching patterns like the following in log files?

    Mar  2 03:23:38

I'm working on organically training logcheck to match known good log 
entries.  So I'm *DEEP* in the bowels of extended regular expressions 
(GNU egrep) that runs over all logs hourly.  As such, I'm interested in 
making sure that my REs are both efficient and accurate or at least not 
WILDLY badly structured.  The pedantic part of me wants to avoid 
wildcard type matches (\w), even if they are bounded (\w{3}), unless it 
truly is for unpredictable text.

I'd appreciate any feedback and recommendations from people who have 
been using and / or optimizing (extended) regular expressions for longer 
than I have been using them.

Thank you for your time and input.



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/8b3f2c41/attachment-0001.p7s>

From clemc at ccc.com  Fri Mar  3 05:23:25 2023
From: clemc at ccc.com (Clem Cole)
Date: Thu, 2 Mar 2023 14:23:25 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
Message-ID: <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>

Grant - check out Russ Cox's web page on this very subject: Implementing
Regular Expressions
<https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.com%2F%7Ersc%2Fregexp%2F>
ᐧ

On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff at tuhs.org> wrote:

> Hi,
>
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
>     ^\w{3}
>
> vs:
>
>     ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
>
> Or:
>
>     [ :[:digit:]]{11}
>
> vs:
>
>     ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
>
> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.
>
> For matching patterns like the following in log files?
>
>     Mar  2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries.  So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly.  As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured.  The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.
>
> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.
>
>
>
> --
> Grant. . . .
> unix || die
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/1fd897a1/attachment.htm>

From coff at tuhs.org  Fri Mar  3 05:38:19 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Thu, 2 Mar 2023 12:38:19 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
Message-ID: <e0e561eb-48dd-10a2-491d-15a801a3a0e7@spamtrap.tnetconsulting.net>

On 3/2/23 12:23 PM, Clem Cole wrote:
> Grant - check out Russ Cox's web page on this very subject: Implementing 
> Regular Expressions
Thank you for the pointer Clem.

It's at the top of my reading list.  I'll dig into the articles listed 
thereon later today.



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/66181f6c/attachment.p7s>

From crossd at gmail.com  Fri Mar  3 07:53:31 2023
From: crossd at gmail.com (Dan Cross)
Date: Thu, 2 Mar 2023 16:53:31 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
Message-ID: <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>

On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff at tuhs.org> wrote:
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
>     ^\w{3}
>
> vs:
>
>     ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)

Well, obviously the former matches any sequence 3 of
alpha-numerics/underscores at the beginning of a string, while the
latter only matches abbreviations of months in the western calendar;
that is, the two REs are matching very different things (the latter is
a strict subset of the former). But I suspect you mean in a more
general sense.

> Or:
>
>     [ :[:digit:]]{11}

...do you really want to match a space, a colon and a single digit 11
times in a single string?

> vs:
>
>     ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]

Using character classes would greatly simplify what you're trying to
do. It seems like this could be simplified to (untested) snippet:

    ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]

For this, I'd probably eschew `[:digit:]`. Named character classes are
for handy locale support, or in lieu of typing every character in the
alphabet (though we can use ranges to abbreviate that), but it kind of
seems like that's not coming into play here and, IMHO, `[0-9]` is
clearer in context.

> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.

It's not clear to me that dates, in their generality, can be matched
with regular expressions.  Consider leap years; you'd almost
necessarily have to use backtracking for that, but I admit I haven't
thought it through.

> For matching patterns like the following in log files?
>
>     Mar  2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries.  So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly.  As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured.  The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.

`\w` is a GNU extension; I'd probably avoid it on portability grounds
(though `\b` is very handy).

The thing about regular expressions is that they describe regular
languages, and regular languages are those for which there exists a
finite automaton that can recognize the language. An important class
of finite automata are deterministic finite automata; by definition,
recognition by such automata are linear in the length of the input.

However, construction of a DFA for any given regular expression can be
superlinear (in fact, it can be exponential) so practically speaking,
we usually construct non-deterministic finite automata (NDFAs) and
"simulate" their execution for matching. NDFAs generalize DFAs (DFAs
are a subset of NDFAs, incidentally) in that, in any non-terminal
state, there can be multiple subsequent states that the machine can
transition to given an input symbol. When executed, for any state, the
simulator will transition to every permissible subsequent state
simultaneously, discarding impossible states as they become evident.

This implies that NDFA execution is superlinear, but it is bounded,
and is O(n*m*e), where n is the length of the input, m is the number
of nodes in the state transition graph corresponding to the NDFA, and
e is the maximum number of edges leaving any node in that graph (for a
fully connected graph, that would m, so this can be up to O(n*m^2)).
Construction of an NDFA is O(m), so while it's slower to execute, it's
actually possible to construct in a reasonable amount of time. Russ's
excellent series of articles that Clem linked to gives details and
algorithms.

> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.

In practical terms? Basically, don't worry about it too much. Egrep
will generate an NDFA simulation that's going to be acceptably fast
for all but the weirdest cases.

        - Dan C.

From stuff at riddermarkfarm.ca  Fri Mar  3 09:01:40 2023
From: stuff at riddermarkfarm.ca (Stuff Received)
Date: Thu, 2 Mar 2023 18:01:40 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
Message-ID: <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>

On 2023-03-02 14:23, Clem Cole wrote:
> Grant - check out Russ Cox's web page on this very subject: Implementing 
> Regular Expressions 
> <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.com%2F%7Ersc%2Fregexp%2F>

Clem, why are you linking through streaklinks.com?

N.

From steffen at sdaoden.eu  Fri Mar  3 09:46:12 2023
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Fri, 03 Mar 2023 00:46:12 +0100
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
 <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>
Message-ID: <20230302234612.qQ4rn%steffen@sdaoden.eu>

Stuff Received wrote in
 <c4328f0f-f304-e131-a9f3-32ce01fa6814 at riddermarkfarm.ca>:
 |On 2023-03-02 14:23, Clem Cole wrote:
 |> Grant - check out Russ Cox's web page on this very subject: Implementing 
 |> Regular Expressions 
 |> <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.co\
 |> m%2F%7Ersc%2Fregexp%2F>
 |
 |Clem, why are you linking through streaklinks.com?

I do not want to be unfriendly; (but) i use firefox-bin
(mozilla-compiled) and my only extension is uMatrix that i have
been pointed to, and i can only recommend it highly to anyone
(though the "modern" web mostly requires to turn off tracking
protection and numerous white flags in uMatrix to work), maybe
even to those who simply put their browser into a container.
Anyhow, uMatrix gives you the following, and while i have not
tried to selectively click me through to get to the target,
i could have done so:

  streak.com
  www.streak.com
  cloudflare.com
  cdnjs.cloudflare.com
  d3e54v103j8qbb.cloudfront.net
  facebook.net
  connect.facebook.net
  google.com
  www.google.com
  ajax.googleapis.com
  storage.googleapis.com
  intercom.io
  widget.intercom.io
  licdn.com
  snap.licdn.com
  pdst.fm
  cdn.pdst.fm
  producthunt.com
  api.producthunt.com
  sentry-cdn.com
  js.sentry-cdn.com
  website-files.com
  assets.website-files.com
  assets-global.website-files.com
  google-analytics.com
  www.google-analytics.com
  googletagmanager.com
  www.googletagmanager.com

Randomized links i find just terrible.  IETF started using
randomized archive links, which are mesmerising; most often
mailman archive links give you a bit of orientation by themselves,
isn't that more appealing.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From coff at tuhs.org  Fri Mar  3 11:05:51 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Thu, 2 Mar 2023 18:05:51 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
Message-ID: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net>

On 3/2/23 2:53 PM, Dan Cross wrote:
> Well, obviously the former matches any sequence 3 of 
> alpha-numerics/underscores at the beginning of a string, while the 
> latter only matches abbreviations of months in the western calendar; 
> that is, the two REs are matching very different things (the latter 
> is a strict subset of the former).

I completely agree with you.  That's also why I'm wanting to start 
utilizing the latter, more specific RE.  But I don't know where the line 
of over complicating things is to avoid crossing it.

> But I suspect you mean in a more general sense.

Yes and no.  Does the comment above clarify at all?

> ...do you really want to match a space, a colon and a single digit 
> 11 times ...

Yes.

> ... in a single string?

What constitutes a single string?  ;-)  I sort of rhetorically ask.

The log lines start with

MMM dd hh:mm:ss

Where:
  - MMM is the month abbreviation
  - dd  is the day of the month
  - hh  is the hour of the day
  - mm  is the minute of the hour
  - ss  is the second of the minute

So, yes, there are eleven characters that fall into the class consisting 
of a space or a colon or a number.

Is that a single string?  It depends what you're looking at, the 
sequences of non white space in the log? No.  The patter that I'm 
matching ya.

> Using character classes would greatly simplify what you're trying to 
> do. It seems like this could be simplified to (untested) snippet:

Agreed.

I'm starting with the examples that came with; "^\w{3} [ 
:[:digit:]]{11}", the logcheck package that I'm working with and 
evaluating what I want to do.

I actually like the idea of dividing out the following:

  - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
  - months that have 30 days: Apr, Jun, Sep, Nov
  - month that have 28/29 days: Feb

> ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]

Aside:  Why do you have the double square brackets in "[12][[0-9]]"?

> For this, I'd probably eschew `[:digit:]`. Named character classes 
> are for handy locale support, or in lieu of typing every character 
> in the alphabet (though we can use ranges to abbreviate that), but 
> it kind of seems like that's not coming into play here and, IMHO, 
> `[0-9]` is clearer in context.

ACK

"[[:digit:]]+" was a construct that I'm parroting.  It and 
[.:[:xdigit:]]+ are good for some things.  But they definitely aren't 
the best for all things.

Hence trying to find the line of being more accurate without going too far.

> It's not clear to me that dates, in their generality, can be 
> matched with regular expressions.  Consider leap years; you'd almost 
> necessarily have to use backtracking for that, but I admit I haven't 
> thought it through.

Given the context that these extended regular expressions are going to 
be used in, logcheck -- filtering out known okay log entries to email 
what doesn't get filtered -- I'm okay with having a few things slip 
through like leap day / leap seconds / leap frogs.

> `\w` is a GNU extension; I'd probably avoid it on portability grounds 
> (though `\b` is very handy).

I hear, understand, and acknowledge your concern.  At present, these 
filters are being used in a package; logcheck, which I believe is 
specific to Debian and ilk.  As such, GNU grep is very much a thing.

I'm also not a fan of the use of `\w` and would prefer to (...|...) things.

> The thing about regular expressions is that they describe regular 
> languages, and regular languages are those for which there exists a 
> finite automaton that can recognize the language. An important class 
> of finite automata are deterministic finite automata; by definition, 
> recognition by such automata are linear in the length of the input.
> 
> However, construction of a DFA for any given regular expression can be 
> superlinear (in fact, it can be exponential) so practically speaking, 
> we usually construct non-deterministic finite automata (NDFAs) and 
> "simulate" their execution for matching. NDFAs generalize DFAs (DFAs 
> are a subset of NDFAs, incidentally) in that, in any non-terminal 
> state, there can be multiple subsequent states that the machine can 
> transition to given an input symbol. When executed, for any state, 
> the simulator will transition to every permissible subsequent state 
> simultaneously, discarding impossible states as they become evident.
> 
> This implies that NDFA execution is superlinear, but it is bounded, 
> and is O(n*m*e), where n is the length of the input, m is the number 
> of nodes in the state transition graph corresponding to the NDFA, and 
> e is the maximum number of edges leaving any node in that graph (for 
> a fully connected graph, that would m, so this can be up to O(n*m^2)). 
> Construction of an NDFA is O(m), so while it's slower to execute, it's 
> actually possible to construct in a reasonable amount of time. Russ's 
> excellent series of articles that Clem linked to gives details and 
> algorithms.

I only vaguely understand those three paragraphs as they are deeper 
computer science than I've gone before.

I think I get the gist of them but could not explain them if my life 
depended upon it.

> In practical terms? Basically, don't worry about it too much. Egrep 
> will generate an NDFA simulation that's going to be acceptably fast 
> for all but the weirdest cases.

ACK

It sounds like I can make any reasonable extended regular expression a 
human can read and I'll probably be good.

Thank you for the detailed response Dan.  :-)



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/58354682/attachment.p7s>

From coff at tuhs.org  Fri Mar  3 11:08:45 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Thu, 2 Mar 2023 18:08:45 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
 <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>
Message-ID: <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net>

On 3/2/23 4:01 PM, Stuff Received wrote:
> Clem, why are you linking through streaklinks.com?

Here's a direct link to the page that I landed on when following Clem's 
link:

Link - Implementing Regular Expressions
  - https://swtch.com/~rsc/regexp/

I didn't pay attention to Clem's link beyond the fact that I got to the 
desired page without needing to tilt at my various filtering plugins.

Though the message I'm replying to has caused a few brain cells to find 
themselves in confusion ~> curiosity.



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/c4aa3526/attachment.p7s>

From dave at horsfall.org  Fri Mar  3 12:10:31 2023
From: dave at horsfall.org (Dave Horsfall)
Date: Fri, 3 Mar 2023 13:10:31 +1100 (EST)
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
 <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>
 <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net>
Message-ID: <alpine.BSF.2.21.9999.2303031309000.4881@aneurin.horsfall.org>

On Thu, 2 Mar 2023, Grant Taylor via COFF wrote:

> Though the message I'm replying to has caused a few brain cells to find 
> themselves in confusion ~> curiosity.

Because evil things can happen with URL redirectors; personally I like to 
know where I'm going beforehand...

-- Dave

From crossd at gmail.com  Fri Mar  3 13:04:32 2023
From: crossd at gmail.com (Dan Cross)
Date: Thu, 2 Mar 2023 22:04:32 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
 <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net>
Message-ID: <CAEoi9W4BjrcyEQdqUigfd+Oa3WYh-H_B4kh84XOoqRKrUmMm2A@mail.gmail.com>

On Thu, Mar 2, 2023 at 8:06 PM Grant Taylor via COFF <coff at tuhs.org> wrote:
> On 3/2/23 2:53 PM, Dan Cross wrote:
> > Well, obviously the former matches any sequence 3 of
> > alpha-numerics/underscores at the beginning of a string, while the
> > latter only matches abbreviations of months in the western calendar;
> > that is, the two REs are matching very different things (the latter
> > is a strict subset of the former).
>
> I completely agree with you.  That's also why I'm wanting to start
> utilizing the latter, more specific RE.  But I don't know where the line
> of over complicating things is to avoid crossing it.

I guess what I'm saying is, match what you want to match and don't sweat
the small stuff.

> > But I suspect you mean in a more general sense.
>
> Yes and no.  Does the comment above clarify at all?

Not exactly. :-)

What I understand you to mean, based on this and the rest of your note, is
that you want to find a good division point between overly specific,
complex REs and simpler, easy to understand REs that are less specific. The
danger with the latter is that they may match things you don't intend,
while the former are harder to maintain and (arguably) more brittle. I can
sympathize.

> > ...do you really want to match a space, a colon and a single digit
> > 11 times ...
>
> Yes.
>
> > ... in a single string?
>
> What constitutes a single string?  ;-)  I sort of rhetorically ask.

For the purposes of grep/egrep, that'll be a logical "line" of text,
terminated by a newline, though the newline itself isn't considered part of
the text for matching. I believe the `-z` option can be used to set a NUL
byte as the "line" terminator; presumably this lets one match strings with
embedded newlines, though I haven't tried.

> The log lines start with
>
> MMM dd hh:mm:ss
>
> Where:
>   - MMM is the month abbreviation
>   - dd  is the day of the month
>   - hh  is the hour of the day
>   - mm  is the minute of the hour
>   - ss  is the second of the minute
>
> So, yes, there are eleven characters that fall into the class consisting
> of a space or a colon or a number.
>
> Is that a single string?  It depends what you're looking at, the
> sequences of non white space in the log? No.  The patter that I'm
> matching ya.

"string" in this context is the input you're attempting to match against.
`egrep` will attempt to match your pattern against each "line" of text it
reads from the files its searching. That is, each line in your log file(s).

But consider what `[ :[:digit:]]{11}` means: you've got a character class
consisting of space, colon and a digit; {11} means "match any of the
characters in that class exactly 11 times" (as opposed to other variations
on the '{}' syntax that say "at least m times", "at most n times", or
"between n and m times"). But that'll match all sorts of things that don't
look like 'dd hh:mm:ss':

term% egrep '[ :[:digit:]]{11}'
11111111111
11111111111
111111111
1111111111111
1111111111111
::::::::::::::::
::::::::::::::::
aaaa                      bbbbb
aaaa                      bbbbb


term%

(The first line is my typing; the second is output from egrep except for
the short line of 9 '1's, for which egrep had no output. That last two
lines are matching space characters and egrep echoing the match, but I'm
guessing gmail will eat those.)

Note that there are inputs with more than 11 characters that match; this is
because there is some 11-character substring that matches the RE  in those
lines. In any event, I suspect this would generally not be what you want.
But if nothing else in your input can match the RE (which you might know a
priori because of domain knowledge about whatever is generating those logs)
then it's no big deal, even if the RE was capable of matching more things
generally.

> > Using character classes would greatly simplify what you're trying to
> > do. It seems like this could be simplified to (untested) snippet:
>
> Agreed.
>
> I'm starting with the examples that came with; "^\w{3} [
> :[:digit:]]{11}", the logcheck package that I'm working with and
> evaluating what I want to do.

Ah. I suspect this relies on domain knowledge about the format of log lines
to match reliably. Otherwise it could match, `___ 123 456:789` which is
probably not what you are expecting.

> I actually like the idea of dividing out the following:
>
>   - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
>   - months that have 30 days: Apr, Jun, Sep, Nov
>   - month that have 28/29 days: Feb

Sure.  One nice thing about `egrep` et al is that you can put the REs into
a file and include them with `-f`, as opposed to having them all directly
on the command line.

> > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
>
> Aside:  Why do you have the double square brackets in "[12][[0-9]]"?

Typo.  :-)

> > For this, I'd probably eschew `[:digit:]`. Named character classes
> > are for handy locale support, or in lieu of typing every character
> > in the alphabet (though we can use ranges to abbreviate that), but
> > it kind of seems like that's not coming into play here and, IMHO,
> > `[0-9]` is clearer in context.
>
> ACK
>
> "[[:digit:]]+" was a construct that I'm parroting.  It and
> [.:[:xdigit:]]+ are good for some things.  But they definitely aren't
> the best for all things.
>
> Hence trying to find the line of being more accurate without going too
far.
>
> > It's not clear to me that dates, in their generality, can be
> > matched with regular expressions.  Consider leap years; you'd almost
> > necessarily have to use backtracking for that, but I admit I haven't
> > thought it through.
>
> Given the context that these extended regular expressions are going to
> be used in, logcheck -- filtering out known okay log entries to email
> what doesn't get filtered -- I'm okay with having a few things slip
> through like leap day / leap seconds / leap frogs.

That seems reasonable.

> > `\w` is a GNU extension; I'd probably avoid it on portability grounds
> > (though `\b` is very handy).
>
> I hear, understand, and acknowledge your concern.  At present, these
> filters are being used in a package; logcheck,

Aside: I found the note on it's website amusing: Brought to you by the UK's
best gambling sites! "Only gamble with what you can afford to lose." Yikes!

> which I believe is
> specific to Debian and ilk.  As such, GNU grep is very much a thing.

I'd proceed with caution here; it also seems to be in the FreeBSD and
DragonFly ports collections and Homebrew on the Mac (but so is GNU grep for
all of those).

> I'm also not a fan of the use of `\w` and would prefer to (...|...)
things.

Yeah. IMHO `\w` is too general for what you're trying to do.

> > The thing about regular expressions is that they describe regular
> > languages, and regular languages are those for which there exists a
> > finite automaton that can recognize the language. An important class
> > of finite automata are deterministic finite automata; by definition,
> > recognition by such automata are linear in the length of the input.
> >
> > However, construction of a DFA for any given regular expression can be
> > superlinear (in fact, it can be exponential) so practically speaking,
> > we usually construct non-deterministic finite automata (NDFAs) and
> > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs
> > are a subset of NDFAs, incidentally) in that, in any non-terminal
> > state, there can be multiple subsequent states that the machine can
> > transition to given an input symbol. When executed, for any state,
> > the simulator will transition to every permissible subsequent state
> > simultaneously, discarding impossible states as they become evident.
> >
> > This implies that NDFA execution is superlinear, but it is bounded,
> > and is O(n*m*e), where n is the length of the input, m is the number
> > of nodes in the state transition graph corresponding to the NDFA, and
> > e is the maximum number of edges leaving any node in that graph (for
> > a fully connected graph, that would m, so this can be up to O(n*m^2)).
> > Construction of an NDFA is O(m), so while it's slower to execute, it's
> > actually possible to construct in a reasonable amount of time. Russ's
> > excellent series of articles that Clem linked to gives details and
> > algorithms.
>
> I only vaguely understand those three paragraphs as they are deeper
> computer science than I've gone before.
>
> I think I get the gist of them but could not explain them if my life
> depended upon it.

Basically, a regular expression is a regular expression if you can build a
machine with no additional memory that can tell you whether or not a given
string matches the RE examining its input one character at a time.

> > In practical terms? Basically, don't worry about it too much. Egrep
> > will generate an NDFA simulation that's going to be acceptably fast
> > for all but the weirdest cases.
>
> ACK
>
> It sounds like I can make any reasonable extended regular expression a
> human can read and I'll probably be good.

I think that's about right.

> Thank you for the detailed response Dan.  :-)

Sure thing!

        - Dan C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/68a07c51/attachment.htm>

From coff at tuhs.org  Fri Mar  3 13:34:19 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Thu, 2 Mar 2023 20:34:19 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <alpine.BSF.2.21.9999.2303031309000.4881@aneurin.horsfall.org>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAC20D2MS5Pp7tJ7Q10cUn-nBhNEzbaZ8_Hn-5J0wAfzd3s3M9A@mail.gmail.com>
 <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>
 <46d95806-8ea6-0eb4-41c9-1ae33e004faa@spamtrap.tnetconsulting.net>
 <alpine.BSF.2.21.9999.2303031309000.4881@aneurin.horsfall.org>
Message-ID: <57a22cdd-2523-d8fd-4004-360da77d4ba0@spamtrap.tnetconsulting.net>

On 3/2/23 7:10 PM, Dave Horsfall wrote:
> Because evil things can happen with URL redirectors; personally I like to
> know where I'm going beforehand...

I absolutely agree.

The confusion was why someone would purposefully choose to use a redirector.



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/a33a1a58/attachment-0001.p7s>

From coff at tuhs.org  Fri Mar  3 13:53:08 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Thu, 2 Mar 2023 20:53:08 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W4BjrcyEQdqUigfd+Oa3WYh-H_B4kh84XOoqRKrUmMm2A@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
 <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net>
 <CAEoi9W4BjrcyEQdqUigfd+Oa3WYh-H_B4kh84XOoqRKrUmMm2A@mail.gmail.com>
Message-ID: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net>

On 3/2/23 8:04 PM, Dan Cross wrote:
> I guess what I'm saying is, match what you want to match and don't sweat 
> the small stuff.

ACK

> Not exactly. :-)
> 
> What I understand you to mean, based on this and the rest of your note, 
> is that you want to find a good division point between overly specific, 
> complex REs and simpler, easy to understand REs that are less specific. 
> The danger with the latter is that they may match things you don't 
> intend, while the former are harder to maintain and (arguably) more 
> brittle. I can sympathize.

You got it.

> For the purposes of grep/egrep, that'll be a logical "line" of text, 
> terminated by a newline, though the newline itself isn't considered part 
> of the text for matching. I believe the `-z` option can be used to set a 
> NUL byte as the "line" terminator; presumably this lets one match 
> strings with embedded newlines, though I haven't tried.

Fair enough.  That's also sort of what I thought might be the case.

> "string" in this context is the input you're attempting to match 
> against. `egrep` will attempt to match your pattern against each "line" 
> of text it reads from the files its searching. That is, each line in 
> your log file(s).

*nod*

> But consider what `[ :[:digit:]]{11}` means: you've got a character 
> class consisting of space, colon and a digit; {11} means "match any of 
> the characters in that class exactly 11 times" (as opposed to other 
> variations on the '{}' syntax that say "at least m times", "at most n 
> times", or "between n and m times").

Yep, I'm well aware of the that.

> But that'll match all sorts of things that don't look like 'dd 
> hh:mm:ss':

That's one of the reasons that I'm interested in coming up with a more 
precise regular expression ... without being overly complex.

> (The first line is my typing; the second is output from egrep except for 
> the short line of 9 '1's, for which egrep had no output. That last two 
> lines are matching space characters and egrep echoing the match, but I'm 
> guessing gmail will eat those.)
> 
> Note that there are inputs with more than 11 characters that match; this 
> is because there is some 11-character substring that matches the RE  in 
> those lines. In any event, I suspect this would generally not be what 
> you want. But if nothing else in your input can match the RE (which you 
> might know a priori because of domain knowledge about whatever is 
> generating those logs) then it's no big deal, even if the RE was capable 
> of matching more things generally.

Yep.

Here's an example of the full RE:

^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ 
postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from 
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]$

As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a 
larger RE and there is bounding & delimiting around the subpart.

This is to match a standard message from postfix via standard SYSLOG.

> Ah. I suspect this relies on domain knowledge about the format of log 
> lines to match reliably. Otherwise it could match, `___ 123 456:789` 
> which is probably not what you are expecting.

Yep.

Though said domain knowledge isn't anything special in and of itself.

> Sure.  One nice thing about `egrep` et al is that you can put the REs 
> into a file and include them with `-f`, as opposed to having them all 
> directly on the command line.

Yep.  logcheck makes extensive use of many files like this to do it's work.

> Typo.  :-)

ACKK

> That seems reasonable.

Thank you for the logic CRC.

> Aside: I found the note on it's website amusing: Brought to you by the 
> UK's best gambling sites! "Only gamble with what you can afford to 
> lose." Yikes!

Um ... that's concerning.

> I'd proceed with caution here; it also seems to be in the FreeBSD and 
> DragonFly ports collections and Homebrew on the Mac (but so is GNU grep 
> for all of those).

Fair enough.

My use case is on Linux where GNU egrep is a thing.

> Yeah. IMHO `\w` is too general for what you're trying to do.

I think that `\w` is a good primer, but not where I want things to end 
up long term.

> Basically, a regular expression is a regular expression if you can build 
> a machine with no additional memory that can tell you whether or not a 
> given string matches the RE examining its input one character at a time.

I /think/ that I could build a complex nested tree of switch statements 
to test each character to see if things match what they should or not. 
Though I would need at least one variable / memory to hold absolutely 
minimal state to know where I am in the switch tree.  I think a number 
to identify the switch statement in question would be sufficient.  So 
I'm guessing two bytes of variable and uncounted bytes of program code.

> I think that's about right.

Thank you again Dan.

> Sure thing!

:-)



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230302/8a59dff2/attachment.p7s>

From ralph at inputplus.co.uk  Fri Mar  3 20:59:28 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Fri, 03 Mar 2023 10:59:28 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
Message-ID: <20230303105928.E88AB215AA@orac.inputplus.co.uk>

Hi Grant,

> What are the pros / cons to creating extended regular expressions like
> the following:

If you want to understand:

- the maths of regular expressions,
- the syntax of regexps which these days expresses more than REs, and
- the regexp engines in programs, the differences in how they work and
  what they match, and
- how to efficiently steer an engine's internals

then I recommend Jeffrey Friedl's Mastering Regular Expressions.
http://regex.info/book.html

> For matching patterns like the following in log files?
>
>     Mar  2 03:23:38

Do you want speed of matching with some false positives or validation by
regexp rather than post-lexing logic and to what depth, e.g. does this
month have a ‘31st’?

    /^... .. ..:..:../

You'd said egrep, which is NDFA, but in other engines, alternation order
can matter, e.g. ‘J’ starts the most months and some months have more
days than others.

    /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|...

-- 
Cheers, Ralph.

From crossd at gmail.com  Fri Mar  3 23:11:23 2023
From: crossd at gmail.com (Dan Cross)
Date: Fri, 3 Mar 2023 08:11:23 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230303105928.E88AB215AA@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
Message-ID: <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>

On Fri, Mar 3, 2023 at 5:59 AM Ralph Corderoy <ralph at inputplus.co.uk> wrote:
> [snip]
>
> If you want to understand:
>
> - the maths of regular expressions,
> - the syntax of regexps which these days expresses more than REs, and
> - the regexp engines in programs, the differences in how they work and
>   what they match, and
> - how to efficiently steer an engine's internals
>
> then I recommend Jeffrey Friedl's Mastering Regular Expressions.
> http://regex.info/book.html

I'm afraid I must sound a note of caution about Friedl's book.  Russ
Cox alludes to some of the problems in the "History and References"
section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that
was linked earlier, and he links to this post:
http://regex.info/blog/2006-09-15/248

The impression is that Friedl shows wonderfully how to _use_ regular
expressions, but does not understand the theory behind their
implementation.

It is certainly true that today what many people refer to as "regular
expressions" are not in fact regular (and require a pushdown automata
to implement, putting them somewhere between REs and the context-free
languages in terms of expressiveness).

Personally, I'd stick with Russ's stuff, especially as `egrep` is the
target here.

        - Dan C.

From ralph at inputplus.co.uk  Fri Mar  3 23:42:15 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Fri, 03 Mar 2023 13:42:15 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>
Message-ID: <20230303134215.3ED63215AA@orac.inputplus.co.uk>

Hi Dan,

> > If you want to understand:
> >
> > - the maths of regular expressions,
> > - the syntax of regexps which these days expresses more than REs, and
> > - the regexp engines in programs, the differences in how they work
> >   and what they match, and
> > - how to efficiently steer an engine's internals
> >
> > then I recommend Jeffrey Friedl's Mastering Regular Expressions.
> > http://regex.info/book.html
>
> I'm afraid I must sound a note of caution about Friedl's book.  Russ
> Cox alludes to some of the problems in the "History and References"
> section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that
> was linked earlier

Russ says:

 1 ‘Finally, any discussion of regular expressions would be incomplete
    without mentioning Jeffrey Friedl's book Mastering Regular Expressions,
    perhaps the most popular reference among today's programmers.
 2  Friedl's book teaches programmers how best to use today's regular
    expression implementations, but not how best to implement them.
 3  What little text it devotes to implementation issues perpetuates the
    widespread belief that recursive backtracking is the only way to
    simulate an NFA.
 4  Friedl makes it clear that he [neither understands nor respects] the
    underlying theory.’  http://regex.info/blog/2006-09-15/248

I think Grant is after what Russ addresses in sentence 2.  :-)

> The impression is that Friedl shows wonderfully how to _use_ regular
> expressions, but does not understand the theory behind their
> implementation.

Yes, Friedl does show that wonderfully.  From long-ago memory, Friedl
understands enough to have diagrams of NFAs and DFAs clocking through
their inputs, showing the differences in number of states, etc.

Yes, Friedl says an NFA must recursively backtrack.  As Russ says in #3,
it was a ‘widespread belief’.  Friedl didn't originate it; I ‘knew’ it
before reading his book.  Friedl was at the sharp end of regexps,
needing to process large amounts of text, at Yahoo! IIRC.  He
investigated how the programs available behaved; he didn't start at the
theory and come up with a new program best suited to his needs.

> Personally, I'd stick with Russ's stuff, especially as `egrep` is the
> target here.

Russ's stuff is great.  He refuted that widespread belief, for one
thing.  But Russ isn't trying to teach a programmer how to best use the
regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes
the many pages needed to do this.

It depends what one wants to learn first.

As Friedl says in the post Russ linked to:

   ‘As a user, you don't care if it's regular, nonregular, unregular,
    irregular, or incontinent.  So long as you know what you can expect
    from it (something this chapter will show you), you know all you need
    to care about.

   ‘For those wishing to learn more about the theory of regular expressions,
    the classic computer-science text is chapter 3 of Aho, Sethi, and
    Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley,
    1986), commonly called “The Dragon Book” due to the cover design.
    More specifically, this is the “red dragon”.  The “green dragon”
    is its predecessor, Aho and Ullman's Principles of Compiler Design.’

In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory,
Languages, and Computation’ goes further into the subject.  Chapter two
has DFA, NFA, epsilon transitions, and uses searching text as an
example.  Chapter three is regular expressions, four is regular
languages.  Pushdown automata is chapter six.

Too many books, not enough time to read.  :-)

-- 
Cheers, Ralph.

From crossd at gmail.com  Fri Mar  3 23:47:39 2023
From: crossd at gmail.com (Dan Cross)
Date: Fri, 3 Mar 2023 08:47:39 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
 <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net>
 <CAEoi9W4BjrcyEQdqUigfd+Oa3WYh-H_B4kh84XOoqRKrUmMm2A@mail.gmail.com>
 <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net>
Message-ID: <CAEoi9W5JBfNC8rEmcv=_fMvKRKcD1CocqnQzFv_oXjcKZ8M8zQ@mail.gmail.com>

On Thu, Mar 2, 2023 at 10:53 PM Grant Taylor via COFF <coff at tuhs.org> wrote:
>[snip
> Here's an example of the full RE:
>
> ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
> postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
> [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
>
> As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
> larger RE and there is bounding & delimiting around the subpart.

Oh, for sure; to be clear, it was obvious that in the earlier
discussion the original was just part of something larger.

FWIW, this RE seems ok to me; the additional context makes it unlikely
to match something else accidentally.

> This is to match a standard message from postfix via standard SYSLOG.
>
> > Ah. I suspect this relies on domain knowledge about the format of log
> > lines to match reliably. Otherwise it could match, `___ 123 456:789`
> > which is probably not what you are expecting.
>
> Yep.
>
> Though said domain knowledge isn't anything special in and of itself.

It needn't be special.  The point is simply that there's some external
knowledge that can be brought to bear to guide the shape of the REs.
In this case, you know that log lines won't begin with `___ 123
456:789` or other similar junk.

> [snip]
> > Basically, a regular expression is a regular expression if you can build
> > a machine with no additional memory that can tell you whether or not a
> > given string matches the RE examining its input one character at a time.
>
> I /think/ that I could build a complex nested tree of switch statements
> to test each character to see if things match what they should or not.
> Though I would need at least one variable / memory to hold absolutely
> minimal state to know where I am in the switch tree.  I think a number
> to identify the switch statement in question would be sufficient.  So
> I'm guessing two bytes of variable and uncounted bytes of program code.

Kinda. The "machine" in this case is actually an abstraction, like a
Turing machine. The salient point here is that REs map to finite state
machines, and in particular, one need not keep (say) a stack of prior
states when simulating them. Note that even in an NDFA simulation,
where one keeps track of what states one may be in, one doesn't need
to keep track of how one got into those states.

Obviously in a real implementation you've got the program counter,
register contents, local variables, etc, all of which consume "memory"
in the conventional sense. But the point is that you don't need
additional memory proportional to anything other than the size of the
RE. DFA implementation could be implemented entirely with `switch` and
`goto` if one wanted, as opposed to a bunch of mutually recursive
function calls, NDFA simulation similarly except that you need some
(bounded) additional memory to hold the active set of states. Contrast
this with a pushdown automata, which can parse a context-free
language, in which a stack is maintained that can store additional
information relative to the input (for example, an already seen
character). Pushdown automata can, for example, recognize matched
parenthesis while regular languages cannot.

Anyway, sorry, this is all rather more theoretical than is perhaps
interesting or useful. Bottom line is, I think your REs are probably
fine. `egrep` will complain at you if they are not, and I wouldn't
worry too much about optimizing them: I'd "stop" whenever you're happy
that you've got something understandable that matches what you want it
to match.

        - Dan C.

From dave at horsfall.org  Sat Mar  4 02:12:31 2023
From: dave at horsfall.org (Dave Horsfall)
Date: Sat, 4 Mar 2023 03:12:31 +1100 (EST)
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230303105928.E88AB215AA@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
Message-ID: <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>

On Fri, 3 Mar 2023, Ralph Corderoy wrote:

> You'd said egrep, which is NDFA, but in other engines, alternation order 
> can matter, e.g. ‘J’ starts the most months and some months have more 
> days than others.
> 
>     /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|...

I can't help but provide an extract from my antispam log summariser (AWK):

    # Yes, I have a warped sense of humour here.
    /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
    {
	date = sprintf("%4d/%.2d/%.2d",
	    year, months[substr($0, 1, 3)], substr($0, 5, 2))

Etc.  The idea is not to validate so much as to grab a line of interest to 
me and extract the bits that I want.

In this case I trust the source (the Sendmail log), but of course that is 
not always the case...

When doing things like this, you need to ask yourself at least the 
following questions:

1) What exactly am I trying to do?  This is fairly important :-)

2) Can I trust the data?  Bobby Tables, Reflections on Trusting Trust...

3) Etc.

And let's not get started on the difference betwixt "trusted" and 
"trustworthy" (that distinction keeps security bods awake at night).

-- Dave

From crossd at gmail.com  Sat Mar  4 03:13:13 2023
From: crossd at gmail.com (Dan Cross)
Date: Fri, 3 Mar 2023 12:13:13 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
Message-ID: <CAEoi9W7hjvuTF=g-3B1E=1-z_o_nBJzre+LqGiKOdPpzna46vA@mail.gmail.com>

On Fri, Mar 3, 2023 at 11:12 AM Dave Horsfall <dave at horsfall.org> wrote:
> [snip]
>     # Yes, I have a warped sense of humour here.
>     /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
>     {
>         date = sprintf("%4d/%.2d/%.2d",
>             year, months[substr($0, 1, 3)], substr($0, 5, 2))

If I may, I'd like to point out something fairly subtle here that, I
think, bears on the original question (paraphrased as, "where does one
draw the line between concision and understandability?").

Note Dave's class to match the first letter of the month:
`[JFMAMJJASOND]`. One may notice that a few letters are repeated (J,
M, A), and one _could_ shorten this to: `[JFMASOND]`. But I can see a
serious argument where that may be regarded as a mistake; in
particular, the original is easy to validate by just saying the names
of the month out loud as one scans the list. For the shorter version,
I'd worry that I would miss something or make a mistake. The lesson
here is keep it simple and don't over-optimize!

> Etc.  The idea is not to validate so much as to grab a line of interest to
> me and extract the bits that I want.
> [snip]

Too true.

A few years ago, Rob Pike gave a talk about lexing in Go that bears on
this that's worth a listen:
https://www.youtube.com/watch?v=HxaD_trXwRE

        - Dan C.

From ralph at inputplus.co.uk  Sat Mar  4 03:38:56 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Fri, 03 Mar 2023 17:38:56 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W7hjvuTF=g-3B1E=1-z_o_nBJzre+LqGiKOdPpzna46vA@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
 <CAEoi9W7hjvuTF=g-3B1E=1-z_o_nBJzre+LqGiKOdPpzna46vA@mail.gmail.com>
Message-ID: <20230303173856.B615421D37@orac.inputplus.co.uk>

Hi,

> Dave Horsfall <dave at horsfall.org> wrote:
> >     # Yes, I have a warped sense of humour here.
> >     /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
...
> in particular, the original is easy to validate by just saying the
> names of the month out loud as one scans the list.

Some clients pay me to read code and find fault.  It's a hard habit to
break.  ‘coc’ smells wrong.  :-)

A bit of vi's :map later...

    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc

The regexp works, of course, but in this case removing the redundancy
would also fix the ‘fault’.

-- 
Cheers, Ralph.

From coff at tuhs.org  Sat Mar  4 05:06:17 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Fri, 3 Mar 2023 12:06:17 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
Message-ID: <c4e87cee-d04f-50be-cb91-edf1b24eaf07@spamtrap.tnetconsulting.net>

Thank you all for very interesting and engaging comments & threads to 
chase / pull / untangle.

I'd like to expand / refine my original question a little bit.

On 3/2/23 11:54 AM, Grant Taylor via COFF wrote:
> I'd like some thoughts ~> input on extended regular expressions used 
> with grep, specifically GNU grep -e / egrep.

While some reading of the references that Clem provided I came across 
multiple indications that back-references can be problematic from a 
performance stand point.

So I'd like to know if all back-references are problematic, or if very 
specific back-references are okay.

Suppose I have the following two lines:

    aaa aaa
    aaa bbb

Does the following RE w/ back-reference introduce a big performance penalty?

    (aaa|bbb) \1

As in:

    % echo "aaa aaa" | egrep "(aaa|bbb) \1"
    aaa aaa

I can easily see how a back reference to something that is not a fixed 
length can become a rabbit hole.  But I'm wondering if a back-reference 
to -- what I think is called -- an alternation (with length fixed in the 
RE) is a performance hit or not.

Now to read and reply to the many good comments that people have shared. 
  :-)



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230303/8af3e274/attachment-0001.p7s>

From crossd at gmail.com  Sat Mar  4 05:09:41 2023
From: crossd at gmail.com (Dan Cross)
Date: Fri, 3 Mar 2023 14:09:41 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230303173856.B615421D37@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
 <CAEoi9W7hjvuTF=g-3B1E=1-z_o_nBJzre+LqGiKOdPpzna46vA@mail.gmail.com>
 <20230303173856.B615421D37@orac.inputplus.co.uk>
Message-ID: <CAEoi9W4vZ+_hkPYL+sD7zFwmMoBPm1zez76pPQVDjARB4gR3oQ@mail.gmail.com>

On Fri, Mar 3, 2023 at 12:39 PM Ralph Corderoy <ralph at inputplus.co.uk> wrote:

> > Dave Horsfall <dave at horsfall.org> wrote:
> > >     # Yes, I have a warped sense of humour here.
> > >     /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
> ...
> > in particular, the original is easy to validate by just saying the
> > names of the month out loud as one scans the list.
>
> Some clients pay me to read code and find fault.  It's a hard habit to
> break.  ‘coc’ smells wrong.  :-)
>
> A bit of vi's :map later...
>
>     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc
>
> The regexp works, of course, but in this case removing the redundancy
> would also fix the ‘fault’.

Ha! Good catch.  I'd probably just write it as,

`(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)` which isn't much
longer than the original anyway.

        - Dan C.

From coff at tuhs.org  Sat Mar  4 05:19:29 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Fri, 3 Mar 2023 12:19:29 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230303134215.3ED63215AA@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>
 <20230303134215.3ED63215AA@orac.inputplus.co.uk>
Message-ID: <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net>

On 3/3/23 6:42 AM, Ralph Corderoy wrote:
> I think Grant is after what Russ addresses in sentence 2.  :-)

You are mostly correct.  The motivation for this thread is very much so 
wanting to learn "how best to use today's regular expression 
implementations".  However there is also the part of me that wants to 
have a little bit of understanding behind why the former is the case.

> Yes, Friedl does show that wonderfully.  From long-ago memory, Friedl
> understands enough to have diagrams of NFAs and DFAs clocking through
> their inputs, showing the differences in number of states, etc.

It seems like I need to find another copy of Friedl's book.  --  My 
current copy is boxed up for a move nearly 1k miles away.  :-/

> Yes, Friedl says an NFA must recursively backtrack.  As Russ says in #3,
> it was a ‘widespread belief’.  Friedl didn't originate it; I ‘knew’ it
> before reading his book.  Friedl was at the sharp end of regexps,
> needing to process large amounts of text, at Yahoo! IIRC.  He
> investigated how the programs available behaved; he didn't start at the
> theory and come up with a new program best suited to his needs.

It sounds like I'm coming from a similar position of "what is the best* 
way to process this corpus" more than "what is the underlying theory 
behind what I'm wanting to do".

> Russ's stuff is great.  He refuted that widespread belief, for one
> thing.  But Russ isn't trying to teach a programmer how to best use the
> regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes
> the many pages needed to do this.

:-)

> It depends what one wants to learn first.

I'm learning that I'm more of a technician that wants to know how to use 
the existing tools to the best of his / their ability.  While having 
some interest in theory behind things.

> As Friedl says in the post Russ linked to:
> 
>     ‘As a user, you don't care if it's regular, nonregular, unregular,
>      irregular, or incontinent.  So long as you know what you can expect
>      from it (something this chapter will show you), you know all you need
>      to care about.

Yep.  That's the position that I would be in if someone were paying me 
to write the REs that I'm writing.

>     ‘For those wishing to learn more about the theory of regular expressions,
>      the classic computer-science text is chapter 3 of Aho, Sethi, and
>      Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley,
>      1986), commonly called “The Dragon Book” due to the cover design.
>      More specifically, this is the “red dragon”.  The “green dragon”
>      is its predecessor, Aho and Ullman's Principles of Compiler Design.’

This all sounds interesting to me, and like something I might add to my 
collection of books.  But it also sounds like something that will be an 
up hill read and vast learning opportunity.

> In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory,
> Languages, and Computation’ goes further into the subject.  Chapter two
> has DFA, NFA, epsilon transitions, and uses searching text as an
> example.  Chapter three is regular expressions, four is regular
> languages.  Pushdown automata is chapter six.
> 
> Too many books, not enough time to read.  :-)

Yep.  Even inventorying and keeping track of the books can be time 
consuming.  --  Thankfully I took some time to do exactly that and have 
access to that information on the super computer in my pocket.



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230303/088dd217/attachment.p7s>

From coff at tuhs.org  Sat Mar  4 05:26:41 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Fri, 3 Mar 2023 12:26:41 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W5JBfNC8rEmcv=_fMvKRKcD1CocqnQzFv_oXjcKZ8M8zQ@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
 <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net>
 <CAEoi9W4BjrcyEQdqUigfd+Oa3WYh-H_B4kh84XOoqRKrUmMm2A@mail.gmail.com>
 <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net>
 <CAEoi9W5JBfNC8rEmcv=_fMvKRKcD1CocqnQzFv_oXjcKZ8M8zQ@mail.gmail.com>
Message-ID: <9bb089cd-1317-6bcf-3bd3-231ce96b333c@spamtrap.tnetconsulting.net>

On 3/3/23 6:47 AM, Dan Cross wrote:
> Oh, for sure; to be clear, it was obvious that in the earlier 
> discussion the original was just part of something larger.

Good.  For a moment I thought that you might be thinking it was stand alone.

> FWIW, this RE seems ok to me; the additional context makes it unlikely 
> to match something else accidentally.

:-)

> It needn't be special.  The point is simply that there's some external 
> knowledge that can be brought to bear to guide the shape of the REs.

ACK

I've heard "domain (specific) knowledge" used to refer to both extremely 
specific training in a field and -- as you have -- data that is having 
something done to it.

> In this case, you know that log lines won't begin with `___ 123 
> 456:789` or other similar junk.

They darned well had better not.

> Kinda. The "machine" in this case is actually an abstraction, like a 
> Turing machine. The salient point here is that REs map to finite state 
> machines, and in particular, one need not keep (say) a stack of prior 
> states when simulating them. Note that even in an NDFA simulation, 
> where one keeps track of what states one may be in, one doesn't need 
> to keep track of how one got into those states.

ACK

> Obviously in a real implementation you've got the program counter, 
> register contents, local variables, etc, all of which consume 
> "memory" in the conventional sense. But the point is that you don't 
> need additional memory proportional to anything other than the size 
> of the RE. DFA implementation could be implemented entirely with 
> `switch` and `goto` if one wanted, as opposed to a bunch of mutually 
> recursive function calls, NDFA simulation similarly except that 
> you need some (bounded) additional memory to hold the active set 
> of states. Contrast this with a pushdown automata, which can parse 
> a context-free language, in which a stack is maintained that can 
> store additional information relative to the input (for example, 
> an already seen character). Pushdown automata can, for example, 
> recognize matched parenthesis while regular languages cannot.

I think I understand the gist of what you're saying, but I need to 
re-read it and think about it a little bit.

> Anyway, sorry, this is all rather more theoretical than is perhaps 
> interesting or useful.

Apology returned to sender as unnecessary.

You are providing the requested thought provoking discussion, which is 
exactly what I asked for.  I feel like I'm going to walk away from this 
thread wiser based on the thread's content plus all additional reading 
material on top of the thread itself.

> Bottom line is, I think your REs are probably fine. `egrep` will 
> complain at you if they are not, and I wouldn't worry too much about 
> optimizing them: I'd "stop" whenever you're happy that you've got 
> something understandable that matches what you want it to match.

Thank you (again) Dan.  :-)



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230303/601a9d65/attachment-0001.p7s>

From crossd at gmail.com  Sat Mar  4 05:31:14 2023
From: crossd at gmail.com (Dan Cross)
Date: Fri, 3 Mar 2023 14:31:14 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <c4e87cee-d04f-50be-cb91-edf1b24eaf07@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <c4e87cee-d04f-50be-cb91-edf1b24eaf07@spamtrap.tnetconsulting.net>
Message-ID: <CAEoi9W45t__mk_MH_2VOn=exhuRhTwE4XdnQCug9Jb0fPVWLOQ@mail.gmail.com>

On Fri, Mar 3, 2023 at 2:06 PM Grant Taylor via COFF <coff at tuhs.org> wrote:
> Thank you all for very interesting and engaging comments & threads to
> chase / pull / untangle.
>
> I'd like to expand / refine my original question a little bit.
>
> On 3/2/23 11:54 AM, Grant Taylor via COFF wrote:
> > I'd like some thoughts ~> input on extended regular expressions used
> > with grep, specifically GNU grep -e / egrep.
>
> While some reading of the references that Clem provided I came across
> multiple indications that back-references can be problematic from a
> performance stand point.
>
> So I'd like to know if all back-references are problematic, or if very
> specific back-references are okay.

The thing about backreferences is that they're not representable in
the regular languages because they require additional state (the thing
the backref refers to), so you cannot construct a DFA corresponding to
them, nor an NDFA simulator (this is where Freidl gets things wrong!);
you really need a pushdown automata and then you're in the domain of
the context-free languages. Therefore, "regexps" that use back
references are not actually regular expressions.

Yet, popular engines support them...but how? Well, pretty much all of
them use a backtracking implementation, which _can_ be exponential in
both time and space.

Now, that said, there are plenty of REs, even some with backrefs,
that'll execute plenty fast enough on backtracking implementations; it
really depends on the expressions in question and the size of strings
you're trying to match against. But you lose the bounding guarantees
DFAs and NDFAs provide.

> Suppose I have the following two lines:
>
>     aaa aaa
>     aaa bbb
>
> Does the following RE w/ back-reference introduce a big performance penalty?
>
>     (aaa|bbb) \1
>
> As in:
>
>     % echo "aaa aaa" | egrep "(aaa|bbb) \1"
>     aaa aaa
>
> I can easily see how a back reference to something that is not a fixed
> length can become a rabbit hole.  But I'm wondering if a back-reference
> to -- what I think is called -- an alternation (with length fixed in the
> RE) is a performance hit or not.

Well, it's more about the implementation strategy than the specific
expression here. Could this become exponential? I don't think this one
would, no; but others may, particularly if you use Kleene closures in
the alternation.

This _is_ something that appears in the wild, by the way, not just in
theory; I did a change to Google's spelling service code to replace
PCRE with re2 precisely because it was blowing up with exponential
memory usage on some user input. The problems went away, but I had to
rewrite a bunch of the REs involved.

        - Dan C.

From coff at tuhs.org  Sat Mar  4 05:36:35 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Fri, 3 Mar 2023 12:36:35 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
Message-ID: <8648a720-62a6-1ed2-b0ba-2dcc38097da6@spamtrap.tnetconsulting.net>

On 3/3/23 9:12 AM, Dave Horsfall wrote:
> I can't help but provide an extract from my antispam log summariser 
> (AWK):
> 
>      # Yes, I have a warped sense of humour here.
>      /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \
>      {
> 	date = sprintf("%4d/%.2d/%.2d",
> 	    year, months[substr($0, 1, 3)], substr($0, 5, 2))

Thank you for sharing that Dave.

> Etc.  The idea is not to validate so much as to grab a line of interest 
> to me and extract the bits that I want.

Fair enough.

Using bracket expressions for the three letters is definitely another 
idea that I hadn't considered.

But I believe I like what I think is -- what I'm going to describe as -- 
the more precise alternation listing out each month. (Jan|Feb|Mar...

Such an alternation is not going to match Jer like the three bracket 
expressions will.  I also believe that the alternation will be easier to 
maintain in the future.  Especially by someone other than me that has 
less experience with REs.

> In this case I trust the source (the Sendmail log), but of course 
> that is not always the case...

I trust that syslog will produce consistent line beginnings more than I 
trust the data that is provided to syslog.  But I'd still like to be 
able to detect "Jer" or "Dot" if syslog ever tosses it's cookies.

> When doing things like this, you need to ask yourself at least the 
> following questions:
> 
> 1) What exactly am I trying to do?  This is fairly important :-)

Filter out known to be okay log entries.

> 2) Can I trust the data?  Bobby Tables, Reflections on Trusting 
> Trust...

Given that I'm effectively negating things and filtering out log entries 
that I want to not see (because they are okay) I'm comfortable with 
trusting the data from syslog.

Brown M&Ms come to mind.

> 3) Etc.
> 
> And let's not get started on the difference betwixt "trusted" and 
> "trustworthy" (that distinction keeps security bods awake at night).

ACK



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230303/eaaed4d0/attachment.p7s>

From ralph at inputplus.co.uk  Sat Mar  4 20:07:17 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sat, 04 Mar 2023 10:07:17 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <c4e87cee-d04f-50be-cb91-edf1b24eaf07@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <c4e87cee-d04f-50be-cb91-edf1b24eaf07@spamtrap.tnetconsulting.net>
Message-ID: <20230304100717.E8F882021A@orac.inputplus.co.uk>

Hi Grant,

> Suppose I have the following two lines:
>
>     aaa aaa
>     aaa bbb
>
> Does the following RE w/ back-reference introduce a big performance
> penalty?
>
>     (aaa|bbb) \1
>
> As in:
>
>     % echo "aaa aaa" | egrep "(aaa|bbb) \1"
>     aaa aaa

You could measure the number of CPU instructions and experiment.

    $ echo xyzaaa aaaxyz >f
    $ ticks() { LC_ALL=C perf stat -e instructions egrep "$@"; }
    $
    $ ticks '(aaa|bbb) \1' <f
    xyzaaa aaaxyz

     Performance counter stats for 'egrep (aaa|bbb) \1':

	       2790889      instructions:u                                              

	   0.009146904 seconds time elapsed

	   0.009178000 seconds user
	   0.000000000 seconds sys


    $

Bear in mind that egreps differ, even within GNU egrep, say, over time.

    $ LC_ALL=C perf stat -e instructions egrep '(aaa|bbb) \1' f
    xyzaaa aaaxyz
    ...
	       2795836      instructions:u                                              
    ...
    $ LC_ALL=C perf stat -e instructions perl -ne '/(aaa|bbb) \1/ and print' f
    xyzaaa aaaxyz
    ...
	       2563488      instructions:u                                              
    ...
    $ LC_ALL=C perf stat -e instructions sed -nr '/(aaa|bbb) \1/p' f
    xyzaaa aaaxyz
    ...
		610213      instructions:u                                              
    ...
    $

-- 
Cheers, Ralph.

From ralph at inputplus.co.uk  Sat Mar  4 20:15:33 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sat, 04 Mar 2023 10:15:33 +0000
Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on
 extended regular expressions in grep.)
In-Reply-To: <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>
 <20230303134215.3ED63215AA@orac.inputplus.co.uk>
 <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net>
Message-ID: <20230304101533.D9CCF2021A@orac.inputplus.co.uk>

Hi,

Grant wrote:
> Even inventorying and keeping track of the books can be time
> consuming.  --  Thankfully I took some time to do exactly that and
> have access to that information on the super computer in my pocket.

I seek recommendations for an Android app to comfortably read PDFs on a
mobile phone's screen.  They were intended to be printed as a book.  In
particular, once I've zoomed and panned to get the interesting part of a
page as large as possible, swiping between pages should persist that
view.  An extra point for allowing odd and even pages to use different
panning.

-- 
Cheers, Ralph.

From ralph at inputplus.co.uk  Sat Mar  4 20:26:51 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sat, 04 Mar 2023 10:26:51 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <8648a720-62a6-1ed2-b0ba-2dcc38097da6@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <alpine.BSF.2.21.9999.2303040254050.4881@aneurin.horsfall.org>
 <8648a720-62a6-1ed2-b0ba-2dcc38097da6@spamtrap.tnetconsulting.net>
Message-ID: <20230304102651.D73622021A@orac.inputplus.co.uk>

Hi Grant,

> the more precise alternation listing out each month. (Jan|Feb|Mar...

For those regexp engines which test each alternative in turn, ordering
the months most-frequent first would give a slight win.  :-)  It really
is a rabbit hole once you start.  Typically not worth entering, but it
can be fun if you like that kind of thing.

> I trust that syslog will produce consistent line beginnings more than
> I trust the data that is provided to syslog.  But I'd still like to be
> able to detect "Jer" or "Dot" if syslog ever tosses it's cookies.

You could develop your regexps to find lines of interest and then flip
them about, e.g. egrep's -v, to see what lines are missed and consider
if any are interesting.  Repeat.  But this happens at development time.

Or at run time, you can have a ‘loose’ regexp to let all expected lines
in through the door and then match with one or more ‘tight’ regexps,
baulking if none do.

There's no right answer in general.

-- 
Cheers, Ralph.

From ralph at inputplus.co.uk  Sun Mar  5 01:15:53 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sat, 04 Mar 2023 15:15:53 +0000
Subject: [COFF] A second Unix Patent
In-Reply-To: <202303041123.324BND9W061456@ultimate.com>
References: <20230304015746.DD95518C08D@mercury.lcs.mit.edu>
 <20230304092216.287E22020E@orac.inputplus.co.uk>
 <202303041123.324BND9W061456@ultimate.com>
Message-ID: <20230304151553.AD3EC210F2@orac.inputplus.co.uk>

Hi Phil,

Copying to the COFF list, hope that's okay.  I thought it might interest
them.

> >     $ units -1v '26^3 16 bit' 64KiB
>
> Works only for GNU units.

That's interesting, thanks.

I've access to a FreeBSD 12.3-RELEASE-p6, if that version number means
something to you.  Its units groks ^ to mean power when applied to a
unit, as the fine units(1) says, but not to a number.  Whereas * works.

    $ units yd^3 ft^3
            * 27
            / 0.037037037
    $ 
    $ units 6\*7 21
            * 2
            / 0.5
    $ 
    $ units 2^4 64 
            * 0.03125
            / 32
    $ 

The last one silently treats 2^4 as 2; I'd say that's a bug.

It has Ki- and byte allowing

    $ units -t Kibyte bit
    8192

but lacks GNU's

    B   byte

Fair enough, though I think that's common enough now to be included.

FreeBSD also seems to have another bug: demanding a space between the
quantity and the unit for fundamental ‘!’ units.

    $ units m 8m
    conformability error
	    1 m
	    8
    $ units m '8 m'
	    * 0.125
	    / 8
    $

I found this when attempting the obvious

    $ units Kibyte 8bit
    conformability error
	    8192 bit
	    8
    $ units Kibyte '8 bit'
	    * 1024
	    / 0.0009765625
    $

Whilst I'm not a GNU acolyte, in this case its version of units does
seem to have had a bit more TLC.  :-)

-- 
Cheers, Ralph.

From egbegb2 at gmail.com  Mon Mar  6 20:01:47 2023
From: egbegb2 at gmail.com (Ed Bradford)
Date: Mon, 6 Mar 2023 04:01:47 -0600
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
Message-ID: <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>

  Thanks, Grant and contributors in
this thread,

Great thread on RE's. I bought and read
the book (it's on the floor over there
in the corner and I'm not getting up).

My task was finding dates in binary
and text files. It turns out RE's work just
fine for that. Because I was looking at
both text files and binary files, I
wrote my stuff using 8-bit python
"bytes" rather than python "text" which
is, I think, 7-bit in python. (I use
python because it works on both
Linux, Macs and Windows and reduces the
number of RE implementations I have
to deal with to 1).

I finished my first round of the
program late fall of 2022. Then
I put it down and now I am
revisiting it. I was creating:

  A Python program to search for
  media files (pictures and movies)
  and copy them to another
  directory tree, copying only the
  unique ones (deduplication), and
  renaming each with

    *YYYY-MM-DD-*

  as a prefix.


Here is a list of observations from my
programming.

1. RE's are quite unreadable. I defined
   a lot of python variables and simply
   added them together in python to make
   a larger byte string (see below).
   The resulting
   expressions were shorter on screen
   and more readable. Furthermore,
   I could construct them incrementally.
   I insist on readable code
   because I frequently put things down
   for a month or more. A while back
   it was a sad day when I restarted
   something and simply had to throw it
   away, moaning, "What was that
   programmer thinking?".

   Here is an example RE for
       YYYY-MM-DD

      # FR = front   BA = back
      # ymdt is text version
      ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP
      ymdc = re.compile( ymdt )


1a. I also had a time defining
    delimiters. There are delimiters
    for the beginning, delimiters
    for internal separation,
    and delimiters for the end.

    The significant thing is I have
    to find the RE if it is the very
    first string in the file or the
    very last. That also complicates
    buffered reading immensely. Hence, I wrote
    the whole program by reading the
    file into a single python variable.
    However, when files become much
    larger than memory, python simply
    ground to a halt as did my Windows
    machine. I then rewrote it using a
    memory mapped file (for all files)
    and the problem was fixed.

2. Dates are formatted in a number of
   ways. I chose exactly one
   format to learn about RE's
   and how to construct them and use
   them. Even the book didn't elaborate
   everything. I could not find
   detailed documentation on some of
   the interfaces in the book.

   On a whim, I asked chatGPT
   to write a python module that returns
   a list of offsets and dates in a file.
   Surprisingly, it wrote one that was
   quite credible. It had bugs but it
   knew more about how to use the various
   functional interfaces in RE's than I
   did.

3. Testing an RE is maybe even more
   difficult than writing one. I have
   not given any serious effort to
   verification testing yet.

I would like to extend my program to
any date format. That would require
a much bigger RE. I have been led to
believe that a 50Kbyte or 500Kbyte
RE works just as well (if not
as fast) as a 100 byte RE. I think
with parentheses and
pipe-symbols suitably used,
one could match

  Monday, March 6, 2023
  2023-03-06
  Mar 6, 2023
  or
  ...

I'm just guessing, though. This
thread has been very informative.
I have much to read.
Thank all of you.

Ed Bradford
Pflugerville, TX




On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <coff at tuhs.org> wrote:

> Hi,
>
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
>     ^\w{3}
>
> vs:
>
>     ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
>
> Or:
>
>     [ :[:digit:]]{11}
>
> vs:
>
>     ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
>
> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.
>
> For matching patterns like the following in log files?
>
>     Mar  2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries.  So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly.  As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured.  The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.
>
> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.
>
>
>
> --
> Grant. . . .
> unix || die
>
>

-- 
Advice is judged by results, not by intentions.
  Cicero
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230306/9061535f/attachment.htm>

From crossd at gmail.com  Tue Mar  7 07:01:51 2023
From: crossd at gmail.com (Dan Cross)
Date: Mon, 6 Mar 2023 16:01:51 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
Message-ID: <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>

On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2 at gmail.com> wrote:
>[snip]
> I would like to extend my program to
> any date format. That would require
> a much bigger RE. I have been led to
> believe that a 50Kbyte or 500Kbyte
> RE works just as well (if not
> as fast) as a 100 byte RE. I think
> with parentheses and
> pipe-symbols suitably used,
> one could match
>
>   Monday, March 6, 2023
>   2023-03-06
>   Mar 6, 2023
>   or
>   ...

This reminds me of something that I wanted to bring up.

Perhaps one _could_ define a sufficiently rich regular expression that
one could match a number of date formats. However, I submit that one
_should not_. REs may be sufficiently powerful, but in all likelihood
what you'll end up with is an unreadable mess; it's like people who
abuse `sed` or whatever to execute complex, general purpose programs:
yeah, it's a clever hack, but that doesn't mean you should do it.

Pick the right tool for the job. REs are a powerful tool, but they're
not the right tool for _every_ job, and I'd argue that once you hit a
threshold of complexity that'll be mostly self-evident, it's time to
move on to something else.

As for large vs small REs.... When we start talking about differences
of orders of magnitude in size, we start talking about real
performance implications; in general an NDFA simulation of a regular
expression will have on the order of the length of the RE in states,
so when the length of the RE is half a million symbols, that's
half-a-million states, which practically speaking is a pretty big
number, even though it's bounded is still a pretty big number, and
even on modern CPUs.

I wouldn't want to poke that bear.

        - Dan C.

From steffen at sdaoden.eu  Tue Mar  7 07:49:05 2023
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Mon, 06 Mar 2023 22:49:05 +0100
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
Message-ID: <20230306214905.vK5oe%steffen@sdaoden.eu>

Dan Cross wrote in
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw at mail.gmail.com>:
 |On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2 at gmail.com> wrote:
 |>[snip]
 |> I would like to extend my program to
 |> any date format. That would require
 |> a much bigger RE. I have been led to
 ...
 |> one could match
 |>
 |>   Monday, March 6, 2023
 |>   2023-03-06
 |>   Mar 6, 2023
 |>   or
 ...
 |This reminds me of something that I wanted to bring up.

Me too.  If it becomes something regular and stable maybe turn
into a dedicated parser.  (As a lex yacc bison byacc refuser, but
these surely can too.)

 |Perhaps one _could_ define a sufficiently rich regular expression that
 |one could match a number of date formats. However, I submit that one
 |_should not_. REs may be sufficiently powerful, but in all likelihood
 ...

Kurt Shoens implemented some date template parser for BSD Mail in
about 1980 that was successively changed many years later by
Edward Wang in 1988 ([1] commit
[309eb459e35f77985851ce143ad2f9da5f0d90da], 1988-07-08 18:41:33
-0800).  There is strftime(3), but it came later than both to
CSRG, and the Wang thing (in usr.bin/mail/head.c) is a dedicated
thing.

(Ie
  /* Template characters for cmatch_data.tdata:
   * 'A'   An upper case char
   * 'a'   A lower case char
   * ' '   A space
   * '0'   A digit
   * 'O'   An optional digit or space; MUST be followed by '0space'!
   * ':'   A colon
   * '+'  Either a plus or a minus sign */

and then according strings like "Aaa Aaa O0 00:00:00 0000".)

  [1] https://github.com/robohack/ucb-csrg-bsd.git

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From lm at mcvoy.com  Tue Mar  7 11:43:11 2023
From: lm at mcvoy.com (Larry McVoy)
Date: Mon, 6 Mar 2023 17:43:11 -0800
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
Message-ID: <20230307014311.GN5398@mcvoy.com>

On Mon, Mar 06, 2023 at 04:01:51PM -0500, Dan Cross wrote:
> On Mon, Mar 6, 2023 at 5:02???AM Ed Bradford <egbegb2 at gmail.com> wrote:
> >[snip]
> > I would like to extend my program to
> > any date format. That would require
> > a much bigger RE. I have been led to
> > believe that a 50Kbyte or 500Kbyte
> > RE works just as well (if not
> > as fast) as a 100 byte RE. I think
> > with parentheses and
> > pipe-symbols suitably used,
> > one could match
> >
> >   Monday, March 6, 2023
> >   2023-03-06
> >   Mar 6, 2023
> >   or
> >   ...
> 
> This reminds me of something that I wanted to bring up.
> 
> Perhaps one _could_ define a sufficiently rich regular expression that
> one could match a number of date formats. However, I submit that one
> _should not_. REs may be sufficiently powerful, but in all likelihood
> what you'll end up with is an unreadable mess; it's like people who
> abuse `sed` or whatever to execute complex, general purpose programs:
> yeah, it's a clever hack, but that doesn't mean you should do it.

Dan, I agree with you.  I ran a software company for almost 20 years
and the main thing I contributed was "lets be dumb".  Lets write code
that is easy to read, easy to bug fix.

Smart engineers love to be clever, they would be the folks that wrote
those long RE that worked magic.  But that magic was something they
understood and nobody else did.

Less is more.  Less is easy to support.

From egbegb2 at gmail.com  Tue Mar  7 14:01:14 2023
From: egbegb2 at gmail.com (Ed Bradford)
Date: Mon, 6 Mar 2023 22:01:14 -0600
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230307014311.GN5398@mcvoy.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
Message-ID: <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>

I have made an attempt to make my RE stuff readable and supportable. I
think I write more description that I do RE "code". As for, *it won't be
comprehendable,* Machine language
was unreadable and then along came assembly language. Assembly language was
unreadable, then came higher level languages. Even higher level languages
are unsupportable if not well documented and mostly simple to understand
("you are not expected to understand this" notwithstanding). The jump from
machine language to python today
was unimagined in early times.

    [
     As an old timer, I see inflection points
     between:

       machine language and assembly language
       assembly language and high level languages
       and
       high level languages and python.

      But that's just me.
     ]



I think it is possible to make a 50K RE that is understandable. However, it
requires
a lot of 'splainin' throughout the code. I'm naive though; I will
eventually discover
a lack of truth in that belief, if such exists.

I repeat. I put stuff down for months at a time. My metric is *coming back
to it*
*and understanding where I left off*. So far, I can do that for this RE
program that
works for small files, large files,
binary files and text files for exactly one pattern:

    YYYY[-MM-DD]

I constructed this RE with code like this:

# ymdt is YYYY-MM-DD RE in text.

# looking only for 1900s and 2000s years and no later than today.
_YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"

# months
_MM   = "(0[1-9]|1[012])"

# days
_DD   = "(0[1-9]|[12]\d|3[01])"

ymdt = _YYYY + '[' + _INTERNALSEP +
                     _MM          +
                     _INTERNALSEP +
               ']'{0,1)

For the whole file, RE I used

ymdthf = _FRSEP + ymdt + _BASEP

where FRSEP is front separator which includes
a bunch of possible separators, excluding numbers and letters, or-ed
with the up arrow "beginning of line" RE mark. BASEP is back separator
is same as FRSEP with "^" replaced with "$".

I then aimed ymdthf at "data" the thing that represents
the entire memory mapped file (where there is only one beginning
and one end).

Again, I say validating an RE is as difficult or more than writing one.
What does it miss?

Dates are an excellent test ground for RE's. Latitude and longitude is
another.

Ed

PS: I thought I was on the COFF mailing list. I received this email
by direct mail to from Larry. I haven't seen any other comments
on my submission. I might have unsubscribed, but now I regret it. Dear
powers
that be: Please resubscribe me.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230306/eb6889e1/attachment.htm>

From egbegb2 at gmail.com  Tue Mar  7 14:19:42 2023
From: egbegb2 at gmail.com (Ed Bradford)
Date: Mon, 6 Mar 2023 22:19:42 -0600
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
Message-ID: <CAHTagfH=1tUVKgS8HQ8Y03j8MBLQNnw0OQDV3nE31rKc3HoJNQ@mail.gmail.com>

Hi Dan,

It sounds to me like an "optimizer" is needed. There is alreay a compiler
that uses FA's. Is someone else going to create a program
to look for dates without using regular expressions?

Today, I write small-sized RE's. If I write a giant RE, there is nothing
preventing
the owner of RE world to change how they are used. For instance. Compile
your RE
and a subroutine/function is produced that performs the RE search.

RE is a *language*, not necessarily an implementation.
At least that is my understanding.


Ed


On Mon, Mar 6, 2023 at 3:02 PM Dan Cross <crossd at gmail.com> wrote:

> On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2 at gmail.com> wrote:
> >[snip]
> > I would like to extend my program to
> > any date format. That would require
> > a much bigger RE. I have been led to
> > believe that a 50Kbyte or 500Kbyte
> > RE works just as well (if not
> > as fast) as a 100 byte RE. I think
> > with parentheses and
> > pipe-symbols suitably used,
> > one could match
> >
> >   Monday, March 6, 2023
> >   2023-03-06
> >   Mar 6, 2023
> >   or
> >   ...
>
> This reminds me of something that I wanted to bring up.
>
> Perhaps one _could_ define a sufficiently rich regular expression that
> one could match a number of date formats. However, I submit that one
> _should not_. REs may be sufficiently powerful, but in all likelihood
> what you'll end up with is an unreadable mess; it's like people who
> abuse `sed` or whatever to execute complex, general purpose programs:
> yeah, it's a clever hack, but that doesn't mean you should do it.
>
> Pick the right tool for the job. REs are a powerful tool, but they're
> not the right tool for _every_ job, and I'd argue that once you hit a
> threshold of complexity that'll be mostly self-evident, it's time to
> move on to something else.
>
> As for large vs small REs.... When we start talking about differences
> of orders of magnitude in size, we start talking about real
> performance implications; in general an NDFA simulation of a regular
> expression will have on the order of the length of the RE in states,
> so when the length of the RE is half a million symbols, that's
> half-a-million states, which practically speaking is a pretty big
> number, even though it's bounded is still a pretty big number, and
> even on modern CPUs.
>
> I wouldn't want to poke that bear.
>
>         - Dan C.
>


-- 
Advice is judged by results, not by intentions.
  Cicero
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230306/440cb1d7/attachment.htm>

From ralph at inputplus.co.uk  Tue Mar  7 21:39:49 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Tue, 07 Mar 2023 11:39:49 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in grep.
In-Reply-To: <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
 <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
Message-ID: <20230307113949.501602135B@orac.inputplus.co.uk>

Hi Ed,

> I have made an attempt to make my RE stuff readable and supportable.

Readable to you, which is fine because you're the prime future reader.
But it's less readable than the regexp to those that know and read them
because of the indirection introduced by the variables.  You've created
your own little language of CAPITALS rather than the lingua franca of
regexps.  :-)

> Machine language was unreadable and then along came assembly language.
> Assembly language was unreadable, then came higher level languages.

Each time the original language was readable because practitioners had
to read and write it.  When its replacement came along, the old skill
was no longer learnt and the language became ‘unreadable’.

> So far, I can do that for this RE program that works for small files,
> large files, binary files and text files for exactly one pattern:
>     YYYY[-MM-DD]
> I constructed this RE with code like this:
>     # ymdt is YYYY-MM-DD RE in text.
>     # looking only for 1900s and 2000s years and no later than today.
>     _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"

‘{1}’ is redundant.

>     # months
>     _MM   = "(0[1-9]|1[012])"
>     # days
>     _DD   = "(0[1-9]|[12]\d|3[01])"
>     ymdt = _YYYY + '[' + _INTERNALSEP +
>                          _MM          +
>                          _INTERNALSEP +
>                    ']'{0,1)

I think we're missing something as the ‘'['’ is starting a character
class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
matching brackets and is outside the string.

BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.

> For the whole file, RE I used
>     ymdthf = _FRSEP + ymdt + _BASEP
> where FRSEP is front separator which includes
> a bunch of possible separators, excluding numbers and letters, or-ed
> with the up arrow "beginning of line" RE mark.

It sounds like you're wanting a word boundary; something provided by
regexps.  In Python, it's ‘\b’.

    >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
    (<re.Match object; span=(16, 19), match='foo'>,)

Are you aware of the /x modifier to a regexp which ignores internal
whitespace, including linefeeds?  This allows a large regexp to be split
over lines.  There's a comment syntax too.  See
https://docs.python.org/3/library/re.html#re.X

GNU grep isn't too shabby at looking through binary files.  I can't use
/x with grep so in a bash script, I'd do it manually.  \< and \> match
the start and end of a word, a bit like Python's \b.

    re='
        .?\<
            (19[0-9][0-9]|20[01][0-9]|202[0-3])
            (
                ([-:._])
                (0[1-9]|1[0-2])
                \3
                (0[1-9]|[12][0-9]|3[01])
            )?
        \>.?
    '
    re=${re//$'\n'/}
    re=${re// /}

    printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 >big-binary-file
    LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l

which gives

    0:2001-04-01,$
    11:1999_12_31$
    22:1944.03.01,$
    33:1914!$
    39:2000-$

showing:

- the byte offset within the file of each match,
- along with the any before and after byte if it's not a \n and not
  already matched, just to show the word-boundary at work,
- with any non-printables escaped into octal by sed.

> I thought I was on the COFF mailing list.

I'm sending this to just the list.

> I received this email by direct mail to from Larry.

Perhaps your account on the list is configured to not send you an email
if it sees your address in the header's fields.

-- 
Cheers, Ralph.

From crossd at gmail.com  Wed Mar  8 02:14:49 2023
From: crossd at gmail.com (Dan Cross)
Date: Tue, 7 Mar 2023 11:14:49 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
 <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
Message-ID: <CAEoi9W6sLfpoGo8pGF_1RYA62Pi8aCWdyo53vNKqdtxyFKTGpA@mail.gmail.com>

On Mon, Mar 6, 2023 at 11:01 PM Ed Bradford <egbegb2 at gmail.com> wrote:
>[snip]
> I think it is possible to make a 50K RE that is understandable. However, it requires
> a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover
> a lack of truth in that belief, if such exists.

Actually, I believe you. I'm sure that with enough effort, it _is_
possible to make a 50K RE that is understandable to mere mortals. But
it begs the question: why bother? The answer to that question, in my
mind, shows the difference between a clever programmer and a pragmatic
engineer. I submit that it's time to reach for another tool well
before you get to an RE that big, and if one is still considering such
a thing, one must really ask what properties of REs and the problem at
hand one thinks lend itself to that as the solution.

>[snip]
> It sounds to me like an "optimizer" is needed. There is alreay a compiler
> that uses FA's.

I'm not sure what you're referring to here, though you were replying
to me. There are a couple of different threads floating around:

1. Writing really big regular expressions: this is probably a bad
idea. Don't do it (see below).
2. Writing a recognizer for dates. Yeah, the small REs you have for
that are fine. If you want to extend those to arbitrary date formats,
I think you'll find it starts getting ugly.
3. Optimizing regular expressions. You're still bound by the known
theoretical properties of finite automata here.

> Is someone else going to create a program
> to look for dates without using regular expressions?

Many people have already done so. :-)

> Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing
> the owner of RE world to change how they are used. For instance. Compile your RE
> and a subroutine/function is produced that performs the RE search.

I'm not sure I understand what you mean.

The theory here is well-understood: we know recognizers for regular
languages can be built from DFAs, that run in time linear in the size
of their inputs, but we also know that constructing such a DFA can be
exponential in space and time, and thus impractical for many REs.

We know that NDFA simulators can be built in time and space linear in
the length of the RE, but that the resulting recognizers will be
superlinear at runtime, proportional to the product of the length of
input, number of states, and number edges between states in the state
transition graph. For a very large regular expression, that's going to
be a pretty big number, and even on modern CPUs won't be particularly
fast. Compilation to native code won't really help you.

There is no "owner of RE world" that can change that. If you can find
some way to do so, I think that would qualify as a major breakthrough
in computer science.

> RE is a language, not necessarily an implementation.
> At least that is my understanding.

Regular expressions describe regular languages, but as I mentioned
above, the theory gives the currently understood bounds for their
performance characteristics. It's kinda like the speed of light in
this regard; we can't really make it go faster.

        - Dan C.

From tytso at mit.edu  Wed Mar  8 02:42:14 2023
From: tytso at mit.edu (Theodore Ts'o)
Date: Tue, 7 Mar 2023 11:42:14 -0500
Subject: [COFF] [TUHS] Re: Origins of the frame buffer device
In-Reply-To: <20230306232429.GL5398@mcvoy.com>
References: <8BD57BAB138946830AF560E17376A63B.for-standards-violators@oclsc.org>
 <20230306232429.GL5398@mcvoy.com>
Message-ID: <20230307164214.GC960946@mit.edu>

(Moving to COFF)

On Mon, Mar 06, 2023 at 03:24:29PM -0800, Larry McVoy wrote:
> But even that seems suspect, I would think they could put some logic
> in there that just doesn't feed power to the GPU if you aren't using
> it but maybe that's harder than I think.
> 
> If it's not about power then I don't get it, there are tons of transistors
> waiting to be used, they could easily plunk down a bunch of GPUs on the
> same die so why not?  Maybe the dev timelines are completely different
> (I suspect not, I'm just grabbing at straws).

Other potential reasons:

1) Moving functionality off-CPU also allows for those devices to have
their own specialized video memory that might be faster (SDRAM) or
dual-ported (VRAM) without having to add that complexity to the more
general system DRAM and/or the CPU's Northbridge.

2) In some cases, having an off-chip co-processor may not need any
access to the system memory at well.  An example of this is the "bump
in the wire" in-line crypto engines (ICE) which is located between the
Southbridge and the eMMC/UFS flash storage device.  If you are using a
Android device, it's likely to have an ICE.  The big advantage is that
it avoids needing to have a bounce buffer on the write path, where the
file system encryption layer has to copy-and-encrypt data from the
page cache to a bounce buffer, and then the encrypted block will then
get DMA'ed to the storage device.

3) From an architectural perspective, not all use cases need various
co-processors, whether it is to doing cryptography, or running some
kind of machine-learning module, or image manipulation to simulate
bokeh, or create HDR images, etc.  While RISC-V does have the concept
of instructure set extensions, which can be developed without getting
permission from the "owners" of the core CPU ISA (e.g., ARM, Intel,
etc.), it's a lot more convenient for someone who doesn't need to bend
the knee to ARM, inc. (or their new corporate overloads) or Intel, to
simply put that extension outside the core ISA.

(More recently, there is an interesting lawsuit about whether it's
"allowed" to put a 3rd party co-processor on the same SOC without
paying $$$$$ to the corporate overload, which may make this point moot
--- although it might cause people to simply switch to another ISA
that doesn't have this kind of lawsuit-happy rent-seeking....)

In any case, if you don't need to play Quake with 240 frames per
second, then there's no point putting the GPU in the core CPU
architecture, and it may turn out that the kind of co-processor which
is optimized for running ML models is different, and it is often
easier to make changes to the programming model for a GPU, compared to
making changes to a CPU's ISA.

						- Ted

From ralph at inputplus.co.uk  Wed Mar  8 03:34:51 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Tue, 07 Mar 2023 17:34:51 +0000
Subject: [COFF] Requesting thoughts on extended regular expressions in grep.
In-Reply-To: <CAEoi9W6sLfpoGo8pGF_1RYA62Pi8aCWdyo53vNKqdtxyFKTGpA@mail.gmail.com>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
 <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
 <CAEoi9W6sLfpoGo8pGF_1RYA62Pi8aCWdyo53vNKqdtxyFKTGpA@mail.gmail.com>
Message-ID: <20230307173451.D94B421C9B@orac.inputplus.co.uk>

Hi Dan,

> I'm sure that with enough effort, it _is_ possible to make a 50K RE
> that is understandable to mere mortals.  But it begs the question: why
> bother?

It could be the quickest way to express the intent.

> The answer to that question, in my mind, shows the difference between
> a clever programmer and a pragmatic engineer.

I think those two can overlap.  :-)

> I submit that it's time to reach for another tool well before you get
> to an RE that big

Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
grammar?  I think sticking with code aimed at regular grammars, or more
likely regexps, will do better than, say, a parser generator for a
type-two context-free grammar.

As well as the lex(1) family, there's Ragel as another example.
http://www.colm.net/open-source/ragel/

> 3.  Optimizing regular expressions.  You're still bound by the known
> theoretical properties of finite automata here.

Well, we're back to the RE v. regexp again.  /^[0-9]+\.jpeg$/ is matched
by some engines by first checking the last five bytes are ‘.jpeg’.

    $ debugperl -Dr -e \
    >     '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
    ...
    Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg"
    Intuit: trying to determine minimum start position...
      doing 'check' fbm scan, [1..46] gave -1
      Did not find floating substr ".jpeg"$...
    Match rejected by optimizer
    Freeing REx: "^[0-9]+\.jpeg$"
    $

Boyer-Moore string searching can be used.  Common-subregexp-elimination
can spot repetitive fragment of regexp and factor them into a single set
of states along with pairing the route into them with the appropriate
route out.

The more regexp engines are optimised, the more benefit to the
programmer from sticking to a regexp rather than, say, ad hoc parsing.

The theory of REs is interesting and important, but regexps deviate from
it ever more.

-- 
Cheers, Ralph.

From coff at tuhs.org  Wed Mar  8 04:31:55 2023
From: coff at tuhs.org (Grant Taylor via COFF)
Date: Tue, 7 Mar 2023 11:31:55 -0700
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230307113949.501602135B@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
 <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
 <20230307113949.501602135B@orac.inputplus.co.uk>
Message-ID: <ef8945e4-c25c-eed5-2480-78f18d9bc75a@spamtrap.tnetconsulting.net>

On 3/7/23 4:39 AM, Ralph Corderoy wrote:
> Readable to you, which is fine because you're the prime future 
> reader.  But it's less readable than the regexp to those that know 
> and read them because of the indirection introduced by the variables. 
> You've created your own little language of CAPITALS rather than the 
> lingua franca of regexps.  :-)

I want to agree, but then I run into things like this:

    ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ 
postfix(/smtps)?/smtpd\[[[:digit:]]+\]: disconnect from 
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]( helo=[[:digit:]]+(/[[:digit:]]+)?)?( 
ehlo=[[:digit:]]+(/[[:digit:]]+)?)?( 
starttls=[[:digit:]]+(/[[:digit:]]+)?)?( 
auth=[[:digit:]]+(/[[:digit:]]+)?)?( 
mail=[[:digit:]]+(/[[:digit:]]+)?)?( 
rcpt=[[:digit:]]+(/[[:digit:]]+)?)?( 
data=[[:digit:]]+(/[[:digit:]]+)?)?( 
bdat=[[:digit:]]+(/[[:digit:]]+)?)?( 
rset=[[:digit:]]+(/[[:digit:]]+)?)?( 
noop=[[:digit:]]+(/[[:digit:]]+)?)?( 
quit=[[:digit:]]+(/[[:digit:]]+)?)?( 
unknown=[[:digit:]]+(/[[:digit:]]+)?)?( 
commands=[[:digit:]]+(/[[:digit:]]+)?)?$

Which is produced by this m4:

    define(`DAEMONPID', `$1\[DIGITS\]:')dnl
    define(`DATE', `\w{3} [ :[:digit:]]{11}')dnl
    define(`DIGIT', `[[:digit:]]')dnl
    define(`DIGITS', `DIGIT+')dnl
    define(`HOST', `[._[:alnum:]-]+')dnl
    define(`HOSTIP', `HOST\[IP\]')dnl
    define(`IP', `[.:[:xdigit:]]+')dnl
    define(`VERB', `( $1=DIGITS`'(/DIGITS)?)?')dnl
    ^DATE HOST DAEMONPID(`postfix(/smtps)?/smtpd') disconnect from 
 
HOSTIP`'VERB(`helo')VERB(`ehlo')VERB(`starttls')VERB(`auth')VERB(`mail')VERB(`rcpt')VERB(`data')VERB(`bdat')VERB(`rset')VERB(`noop')VERB(`quit')VERB(`unknown')VERB(`commands')$

I only consider myself to be an /adequate/ m4 user.  Though I've done 
some things that are arguably creating new languages.

I personally find the generated regular expression to be onerous to read 
and understand, much less modify.  I would be highly dependent on my 
editor's (vim's) parenthesis / square bracket matching (%) capability 
and / or would need to explode the RE into multiple components on 
multiple lines to have a hope of accurately understanding or modifying it.

Conversely I think that the m4 is /largely/ find and replace with a 
little syntactic sugar around the definitions.

I also think that anyone that does understand regular expressions and 
the concept of find & replace is likely to be able to both recognize 
patterns -- as in "VERB(...)" corresponds to "( 
$1=DIGITS`'(/DIGITS)?)?", that "DIGITS" corresponds to "DIGIT+", and 
that "DIGIT" corresponds to "[[:digit:]]".

There seems to be a point between simple REs w/o any supporting 
constructor and complex REs with supporting constructor where I think it 
is better to have the constructors.  Especially when duplication comes 
into play.

If nothing else, the constructors are likely to reduce one-off typo 
errors.  The typo will either be everywhere the constructor was used, or 
similarly be fixed everywhere at the same time.  Conversely, finding an 
unmatched parenthesis or square bracket in the RE above will be annoying 
at best if not likely to be more daunting.

> Each time the original language was readable because practitioners 
> had to read and write it.  When its replacement came along, the old 
> skill was no longer learnt and the language became ‘unreadable’.

I feel like there is an analogy between machine code and assembly 
language as well as assembly language and higher level languages.

My understanding is that the computer industry has vastly agreed that 
the higher level language is easier to understand and maintain.

> ‘{1}’ is redundant.

That may very well be.  But what will be more maintainable / easier to 
correct in the future; adding `{2}` when necessary or changing the value 
of `1` to `2`?

I think this is an example of tradeoff of not strictly required to make 
something more maintainable down the road.  Sort of like fleet vehicles 
vs non-fleet vehicles.

> BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.

I think this is another example of the maintainability.

> I'm sending this to just the list.

I'm also replying to only the COFF mailing list.

> Perhaps your account on the list is configured to not send you an 
> email if it sees your address in the header's fields.

There is a reasonable chance that the COFF mailing list and / or your 
account therein is configured to minimize duplicates meaning the COFF 
mailing list won't send you a copy if it sees your subscribed address as 
receiving a copy directly.

I personally always prefer the mailing list copy and shun the direct 
copies.  I think that the copy from the mailing list keeps the 
discussion on the mailing list and avoids accidental replies bypassing 
the mailing list.



-- 
Grant. . . .
unix || die

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4017 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230307/7a3c0deb/attachment-0001.p7s>

From crossd at gmail.com  Wed Mar  8 04:33:00 2023
From: crossd at gmail.com (Dan Cross)
Date: Tue, 7 Mar 2023 13:33:00 -0500
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230307173451.D94B421C9B@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
 <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
 <CAEoi9W6sLfpoGo8pGF_1RYA62Pi8aCWdyo53vNKqdtxyFKTGpA@mail.gmail.com>
 <20230307173451.D94B421C9B@orac.inputplus.co.uk>
Message-ID: <CAEoi9W6cKJodYLwccCaD9_byceKosqFJ38p1KiSpZtareJ6nQw@mail.gmail.com>

On Tue, Mar 7, 2023 at 12:34 PM Ralph Corderoy <ralph at inputplus.co.uk> wrote:
> > I'm sure that with enough effort, it _is_ possible to make a 50K RE
> > that is understandable to mere mortals.  But it begs the question: why
> > bother?
>
> It could be the quickest way to express the intent.

Ok, I challenge you to find me anything for which the quickest way to
express the intent is a 50 *thousand* symbol regular expression. :-)

> > The answer to that question, in my mind, shows the difference between
> > a clever programmer and a pragmatic engineer.
>
> I think those two can overlap.  :-)

Indeed they can. But I have grave doubts that this is a good example
of such overlap.

> > I submit that it's time to reach for another tool well before you get
> > to an RE that big
>
> Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
> grammar?  I think sticking with code aimed at regular grammars, or more
> likely regexps, will do better than, say, a parser generator for a
> type-two context-free grammar.

Is there an extant, non-theoretical, example of such a grammar?

> As well as the lex(1) family, there's Ragel as another example.
> http://www.colm.net/open-source/ragel/

This is moving the goal posts more than a bit. I'm suggesting that a
50k-symbol RE is unlikely to be the best solution to any reasonable
problem. A state-machine generator, even one with 50k statements, is
not a 50k RE.

> > 3.  Optimizing regular expressions.  You're still bound by the known
> > theoretical properties of finite automata here.
>
> Well, we're back to the RE v. regexp again.  /^[0-9]+\.jpeg$/ is matched
> by some engines by first checking the last five bytes are ‘.jpeg’.

...in general, in order to find the end, won't _something_ have to
traverse the entire input?

(Note that I said, "in general". Allusions to mmap'ed files or seeking
to the end of a file are not general, since they don't apply well to
important categories of input sources, such as pipes or network
connections.)

>     $ debugperl -Dr -e \
>     >     '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
>     ...
>     Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg"
>     Intuit: trying to determine minimum start position...
>       doing 'check' fbm scan, [1..46] gave -1
>       Did not find floating substr ".jpeg"$...
>     Match rejected by optimizer
>     Freeing REx: "^[0-9]+\.jpeg$"
>     $
>
> Boyer-Moore string searching can be used.  Common-subregexp-elimination
> can spot repetitive fragment of regexp and factor them into a single set
> of states along with pairing the route into them with the appropriate
> route out.

Well, that's what big-O notation accounts for.

I'm afraid none of this really changes the time bounds, however, when
applied in general.

> The more regexp engines are optimised, the more benefit to the
> programmer from sticking to a regexp rather than, say, ad hoc parsing.

This is comparing apples and oranges. There may be all sorts of
heuristics that we can apply to specific regular expressions to prune
the search space, and that's great. But by their very nature,
heuristics are not always generally applicable.

As an analogy, we know that we cannot solve _the_ halting problem, but
we also know that we can solve _many_ halting problem_s_. For example,
a compiler can recognize that any of, `for(;;);` or `while(1);` or
`loop {}` do not halt, and so on, ad nauseum, but even if some oracle
can recognize arbitrarily many such halting problems, we still haven't
solved the general problem.

> The theory of REs is interesting and important, but regexps deviate from
> it ever more.

Yup. My post should not be construed as suggesting that regexps are
not useful, or that they should not be a part of a programmer's
toolkit. My post _should_ be construed as a suggestion that they are
not always the best solution, and a part of being an engineer is
finding that dividing line.

        - Dan C.

From rtomek at ceti.pl  Wed Mar  8 07:49:14 2023
From: rtomek at ceti.pl (Tomasz Rola)
Date: Tue, 7 Mar 2023 22:49:14 +0100
Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on
 extended regular expressions in grep.)
In-Reply-To: <20230304101533.D9CCF2021A@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>
 <20230303134215.3ED63215AA@orac.inputplus.co.uk>
 <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net>
 <20230304101533.D9CCF2021A@orac.inputplus.co.uk>
Message-ID: <ZAexWnykYAYoqqtz@tau1.ceti.pl>

On Sat, Mar 04, 2023 at 10:15:33AM +0000, Ralph Corderoy wrote:
> Hi,
> 
> Grant wrote:
> > Even inventorying and keeping track of the books can be time
> > consuming.  --  Thankfully I took some time to do exactly that and
> > have access to that information on the super computer in my pocket.
> 
> I seek recommendations for an Android app to comfortably read PDFs on a
> mobile phone's screen.  They were intended to be printed as a book.  In
> particular, once I've zoomed and panned to get the interesting part of a
> page as large as possible, swiping between pages should persist that
> view.  An extra point for allowing odd and even pages to use different
> panning.

My own recommendation for this is to get a dedicated ebook reader. It
will feel a bit clumsy to have both a cretinphone and another thing
with you, but at least the thing is doing the job. At least, mine
keeps cropping across pages. Also, the e-ink/epaper display of ebook
reader is not supposed to screw your eyes and/or circadian rhythms
(not that I know anything specific, but I find it very strange that
people shine blue light into their eyes for extended periods of time
and do not even quietly protest - well, perhaps it is akin to what
goes between human and a dog, they become alike to each other, now,
when a human has cretinphone...).

Or, if it just one pdf to read, then you should be fine reading it on
bigger screen.

HTH

-- 
Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home    **
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola at bigfoot.com             **

From rtomek at ceti.pl  Wed Mar  8 08:46:04 2023
From: rtomek at ceti.pl (Tomasz Rola)
Date: Tue, 7 Mar 2023 23:46:04 +0100
Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on
 extended regular expressions in grep.)
In-Reply-To: <ZAexWnykYAYoqqtz@tau1.ceti.pl>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <20230303105928.E88AB215AA@orac.inputplus.co.uk>
 <CAEoi9W7D49pdoKXMruSLUp7QO7468FkmwL2E715Nu=DfSHWhmQ@mail.gmail.com>
 <20230303134215.3ED63215AA@orac.inputplus.co.uk>
 <21e8477c-c388-7b90-ed10-21c7f76f0892@spamtrap.tnetconsulting.net>
 <20230304101533.D9CCF2021A@orac.inputplus.co.uk>
 <ZAexWnykYAYoqqtz@tau1.ceti.pl>
Message-ID: <ZAe+rD30l3v/cNNT@tau1.ceti.pl>

On Tue, Mar 07, 2023 at 10:49:14PM +0100, Tomasz Rola wrote:
[...]
> people shine blue light into their eyes for extended periods of time
> and do not even quietly protest - well, perhaps it is akin to what
> goes between human and a dog, they become alike to each other, now,
> when a human has cretinphone...).

To answer unasked question, I own a cretinphone too :-). And few
dumbs. Together, they sum up to something like cretinphone on
steroids.

-- 
Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home    **
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola at bigfoot.com             **

From egbegb2 at gmail.com  Wed Mar  8 21:22:56 2023
From: egbegb2 at gmail.com (Ed Bradford)
Date: Wed, 8 Mar 2023 05:22:56 -0600
Subject: [COFF] Requesting thoughts on extended regular expressions in
 grep.
In-Reply-To: <20230307113949.501602135B@orac.inputplus.co.uk>
References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>
 <CAHTagfH97hXvW4=pMPYfQuJsVCtGtUvfTgDjUhw2kRe6FOUqTg@mail.gmail.com>
 <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>
 <20230307014311.GN5398@mcvoy.com>
 <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>
 <20230307113949.501602135B@orac.inputplus.co.uk>
Message-ID: <CAHTagfGYNi-TvkPMsXBf36a3g-b7D7qtk-xn9k6kiwu0YM7DcA@mail.gmail.com>

Thank you for the very useful comments. However,
I disagree with you about the RE language. While
I agree all RE experts don't need that, when I was
hiring and gave some software to a new hire (whether
an experienced programmer or a recent college grad)
simply handing over huge RE's to my new hire was
a daunting task to that person. I wrote that stuff
that way to help remind me and anyone who might
use the python program.

I don't claim success. It does help me.

When you say '{1}' is redundant, I think I did that
to avoid any possibility of conflicts with the
next string that is
concatentated to the *Y_* (e.g. '*' or '+' or '{4,7}').
I am embarrassed I did not communicate that
in the code. I had to think about it for a couple of hours
before I recalled the "why". I will fix that.

  (it would be difficult to discuss
   this RE if I had to write
       "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ")
   rather than just *Y_*).

My initial thoughts
on naming were I wanted the definition to be defined
in exactly one place in the software.
Python and the BTL folks told me to never
use a constant in code. Always name it.
Hence, I gave it a name. Each name might
be used in multiple places. They might be imported.

You are correct, the expression is unbalanced. I tried
to remove the text2bytes(lastYearRE*)* call so the expression in
this email was all text. I failed to remove the trailing *)* when
I removed the call to text2bytes(). My hasty transcriptions
might have produced similar errors in my email.

Recall, my focus was on any file of any size.
I'm on Windows 10 and an m1 MacBook.
Python works on both. I don't have
a Linux machine or enough desktop space to
host one. I'm also mildly fed-up with
virtual machines.

Friedl taught me one thing. Most
RE implementations are different. I'm trying
to write a program that I could give
to anyone and could reliably find a date (an RE) in
any file. YYYY, MM, DD, HR, MI, SE, TH are words
my user could use in the command line or in
an options dialog. LAT and LON might also be
possibilities. CST, EST, MST, PST, ... also.
A 500 gigabyte archive or directory/folder
of pictures and movies would be
a great test target.

I very much appreciate your comments. If this
discussion is boring to others, I would be happy
to take it to emails.

I like your program. My experience
with RE, grep, python, and sed suggests that
anything but gnu grep and sed might not work due to the
different implementations.

I've been out of the Unix software business
for 30 years after starting work at BTL in the 1970s
and working on Version 6. I didn't know "printf" was now
built into bash! That was a surprise. It's an incremental
improvement, but doesn't compare with f-strings in python.
*The interactive interpreter for python should have*
*a "bash" mode?!*

Does grep use a memory mapped file for its search, thereby
avoiding all buffering boundaries? That too, would
be new information to me. The additional complexity
of dealing with buffering is more than annoying.

Do you have any thoughts on how to verify
a program that uses RE's. I've given no thought
until now. My first thought for dates would be
to write a separate module that simply searched
through the file looking for 4 numbers in a row
without using RE's, recording the offsets and 16 characters
after and 1 character before in a python list of (offset,str)
of tuples, ddddList, and using *dddd**List*
as a proxy for the entire file. I could then
aim my RE's at *ddddList*. *[A list of tuples in python*
*is wonderful! !]* It seems to me '*' and '+' and {x,y} are the performance
hogs in RE's. My RE's avoid them. One pass, I think, should
suffice. What do you think? I haven't "archived" my 350 GB
of pictures and movies, but one pass over all files therein
ought to suffice, right? Two different programs that use different
algorithms should be pretty good proof of correctness wouldn't
you think?

My RE's have no stars or pluses. If there is a mismatch before
a match, give up and move on.

On my Windows 10 machine, I have cygwin.
Microsoft says my CPU doesn't have a TPM and
the specific Intel Core I7 on my system is not
supported so Windows 11 is not happening.
Microsoft is DOS personified.
 (An unkind editorial remark about the low
  quality of software coming from Microsoft.)

Anyway, I thank you again for your patience with me
and your observations. I value your views and the
other views I've seen here on coff at tuhs.org.

I welcome all input to my education and will share
all I have done so far with anyone who wants to
collaborate, test, or is just curious.

    GOAL: run python program from an at-cost thumb drive that:
          reaps all media files from a user specified
          directory/folder tree and

          Adds files to the thumb drive.

          *Adds files* means
            Original file system is untouched

            Adds only unique files (hash codes are unique)

            Creates on the thumb drive a relative directory
              wherein the original file was found

            Prepends a "YYYY-MM-DD-" string to the filename
              if one can be found (EXIF is great shortcut).

            Copies
                      srcroot/relative_path/oldfilename
              to
                   thumbdrive/relative_path/YYYY-MM-DD-oldfilename
                     or
                   thumbdrive/relative_path/0000-oldfilename.

          Can also incrementally add new files by just
            scanning anywhere in any other computer
            file system or any other computer.

          Must work on Mac, Windows, and Linux

What I have is a working prototype. It works
on Mac and Windows. It doesn't do the
date thing very well, and there are other shortcomings.

I have delivered exactly one Christmas present to my favorite person
in the world - a 400 GB SSD drive with all our pictures and media
we have ever taken. The next things are to *add *more media
and *re-unique-ify* (check) what is already present on the SSD drive
and  *improve the proper choice of "YYYY-MM-DD-" prefix* to
filenames.

I am retired and this is fun.
I'm too old to want to get rich.

Ed Bradford
Pflugerville, TX
egbegb2 at gmail.com





On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy <ralph at inputplus.co.uk> wrote:

> Hi Ed,
>
> > I have made an attempt to make my RE stuff readable and supportable.
>
> Readable to you, which is fine because you're the prime future reader.
> But it's less readable than the regexp to those that know and read them
> because of the indirection introduced by the variables.  You've created
> your own little language of CAPITALS rather than the lingua franca of
> regexps.  :-)
>
> > Machine language was unreadable and then along came assembly language.
> > Assembly language was unreadable, then came higher level languages.
>
> Each time the original language was readable because practitioners had
> to read and write it.  When its replacement came along, the old skill
> was no longer learnt and the language became ‘unreadable’.
>
> > So far, I can do that for this RE program that works for small files,
> > large files, binary files and text files for exactly one pattern:
> >     YYYY[-MM-DD]
> > I constructed this RE with code like this:
> >     # ymdt is YYYY-MM-DD RE in text.
> >     # looking only for 1900s and 2000s years and no later than today.
> >     _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"
>
> ‘{1}’ is redundant.
>
> >     # months
> >     _MM   = "(0[1-9]|1[012])"
> >     # days
> >     _DD   = "(0[1-9]|[12]\d|3[01])"
> >     ymdt = _YYYY + '[' + _INTERNALSEP +
> >                          _MM          +
> >                          _INTERNALSEP +
> >                    ']'{0,1)
>
> I think we're missing something as the ‘'['’ is starting a character
> class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
> matching brackets and is outside the string.
>
> BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.
>
> > For the whole file, RE I used
> >     ymdthf = _FRSEP + ymdt + _BASEP
> > where FRSEP is front separator which includes
> > a bunch of possible separators, excluding numbers and letters, or-ed
> > with the up arrow "beginning of line" RE mark.
>
> It sounds like you're wanting a word boundary; something provided by
> regexps.  In Python, it's ‘\b’.
>
>     >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
>     (<re.Match object; span=(16, 19), match='foo'>,)
>
> Are you aware of the /x modifier to a regexp which ignores internal
> whitespace, including linefeeds?  This allows a large regexp to be split
> over lines.  There's a comment syntax too.  See
> https://docs.python.org/3/library/re.html#re.X
>
> GNU grep isn't too shabby at looking through binary files.  I can't use
> /x with grep so in a bash script, I'd do it manually.  \< and \> match
> the start and end of a word, a bit like Python's \b.
>
>     re='
>         .?\<
>             (19[0-9][0-9]|20[01][0-9]|202[0-3])
>             (
>                 ([-:._])
>                 (0[1-9]|1[0-2])
>                 \3
>                 (0[1-9]|[12][0-9]|3[01])
>             )?
>         \>.?
>     '
>     re=${re//$'\n'/}
>     re=${re// /}
>
>     printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01
> >big-binary-file
>     LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l
>
> which gives
>
>     0:2001-04-01,$
>     11:1999_12_31$
>     22:1944.03.01,$
>     33:1914!$
>     39:2000-$
>
> showing:
>
> - the byte offset within the file of each match,
> - along with the any before and after byte if it's not a \n and not
>   already matched, just to show the word-boundary at work,
> - with any non-printables escaped into octal by sed.
>
> > I thought I was on the COFF mailing list.
>
> I'm sending this to just the list.
>
> > I received this email by direct mail to from Larry.
>
> Perhaps your account on the list is configured to not send you an email
> if it sees your address in the header's fields.
>
> --
> Cheers, Ralph.
>


-- 
Advice is judged by results, not by intentions.
  Cicero
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230308/bea1a244/attachment-0001.htm>

From crossd at gmail.com  Thu Mar  9 05:52:43 2023
From: crossd at gmail.com (Dan Cross)
Date: Wed, 8 Mar 2023 14:52:43 -0500
Subject: [COFF] [TUHS] the wheel of reincarnation goes sideways
In-Reply-To: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
Message-ID: <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>

[bumping to COFF]

On Wed, Mar 8, 2023 at 2:05 PM ron minnich <rminnich at gmail.com> wrote:
> The wheel of reincarnation discussion got me to thinking:
>
> What I'm seeing is reversing the rotation of the wheel of reincarnation. Instead of pulling the task (e.g. graphics) from a special purpose device back into the general purpose domain, the general purpose computing domain is pushed into the special purpose device.
>
> I first saw this almost 10 years ago with a WLAN modem chip that ran linux on its 4 core cpu, all of it in a tiny package. It was faster, better, and cheaper than its traditional embedded predecessor -- because the software stack was less dedicated and single-company-created. Take Linux, add some stuff, voila! WLAN modem.
>
> Now I'm seeing it in peripheral devices that have, not one, but several independent SoCs, all running Linux, on one card. There's even been a recent remote code exploit on, ... an LCD panel.
>
> Any of these little devices, with the better part of a 1G flash and a large part of 1G DRAM, dwarfs anything Unix ever ran on. And there are more and more of them, all over the little PCB in a laptop.
>
> The evolution of platforms like laptops to becoming full distributed systems continues.
> The wheel of reincarnation spins counter clockwise -- or sideways?

About a year ago, I ran across an email written a decade or more prior
on some mainframe mailing list where someone wrote something like,
"wow! It just occurred to me that my Athlon machine is faster than the
ES/3090-600J I used in 1989!" Some guy responded angrily, rising to
the wounded honor of IBM, raving about how preposterous this was
because the mainframe could handle a thousand users logged in at one
time and there's no way this Linux box could ever do that.

I was struck by the absurdity of that; it's such a ridiculous
non-comparison. The mainframe had layers of terminal concentrators,
3270 controllers, IO controllers, etc, etc, and a software ecosystem
that made heavy use of all of that, all to keep user interaction _off_
of the actual CPU (I guess freeing that up to run COBOL programs in
batch mode...); it's not as though every time a mainframe user typed
something into a form on their terminal it interrupted the primary
CPU.

Of course, the first guy was right: the AMD machine probably _was_
more capable than a 3090 in terms of CPU performance, RAM and storage
capacity, and raw bandwidth between the CPU and IO subsystems. But the
3090 was really more like a distributed system than the Athlon box
was, with all sorts of offload capabilities. For that matter, a
thousand users probably _could_ telnet into the Athlon system. With
telnet in line mode, it'd probably even be decently responsive.

So often it seems to me like end-user systems are just continuing to
adopt "large system" techniques. Nothing new under the sun.

> I'm no longer sure the whole idea of the wheel or reincarnation is even applicable.

I often feel like the wheel has fallen onto its side, and we're
continually picking it up from the edge and flipping it over, ad
nauseum.

        - Dan C.

From coff at tuhs.org  Thu Mar  9 06:18:42 2023
From: coff at tuhs.org (Tom Ivar Helbekkmo via COFF)
Date: Wed, 08 Mar 2023 21:18:42 +0100
Subject: [COFF] the wheel of reincarnation goes sideways
In-Reply-To: <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
 (Dan Cross's message of "Wed, 8 Mar 2023 14:52:43 -0500")
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
Message-ID: <m2lek7yp25.fsf@thuvia.hamartun.priv.no>

Dan Cross <crossd at gmail.com> writes:

> About a year ago, I ran across an email written a decade or more prior
> on some mainframe mailing list where someone wrote something like,
> "wow! It just occurred to me that my Athlon machine is faster than the
> ES/3090-600J I used in 1989!" Some guy responded angrily, rising to
> the wounded honor of IBM, raving about how preposterous this was
> because the mainframe could handle a thousand users logged in at one
> time and there's no way this Linux box could ever do that.
>
> I was struck by the absurdity of that; it's such a ridiculous
> non-comparison.

I did one of those.  Back in the early nineties, I had a 286 box running
MINIX 1.5 as my home workstation, and a similar one running DOS at work.
My job, however, was as one of a team of sysadmins caring for a VAX-780
running VMS.

I used C-TeX to format documents on the DOS PC, and spent a couple of
days porting it to the VMS C compiler.  Performance was utterly dismal
at first, but once I realized that the stdio stuff in the standard
libary was the problem, I modified C-TeX to do output to binary files of
fixed size 512 byte blocks in RMS, the VMS file system.  In the small
hours of the night, I discovered that the big and expensive VAX-780 was
able to pretty much exactly match my 286-box when formatting documents.

The very next day, I found that the same machine did the TeX formatting
just as fast, while a hundred or so other people were actively using it
for their own work.

-tih
-- 
Most people who graduate with CS degrees don't understand the significance
of Lisp.  Lisp is the most important idea in computer science.  --Alan Kay

From cowan at ccil.org  Thu Mar  9 11:22:39 2023
From: cowan at ccil.org (John Cowan)
Date: Wed, 8 Mar 2023 20:22:39 -0500
Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways
In-Reply-To: <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
Message-ID: <CAD2gp_R=uqX4VidxTik8d1L-KYSsp4KwdMSFXx2kJdrCJ6ojbQ@mail.gmail.com>

On Wed, Mar 8, 2023 at 2:53 PM Dan Cross <crossd at gmail.com> wrote:


> > Now I'm seeing it in peripheral devices that have, not one, but several
> independent SoCs, all running Linux, on one card. There's even been a
> recent remote code exploit on, ... an LCD panel.
>

I remember at one time I had on my desk a PC with an 80x86 CPU and an
Ethernet card that had an 80(x+1)86 chip inside.  I think x=0, but I'm not
sure.


> But the
> 3090 was really more like a distributed system than the Athlon box
> was, with all sorts of offload capabilities. For that matter, a
> thousand users probably _could_ telnet into the Athlon system. With
> telnet in line mode, it'd probably even be decently responsive.
>

I find that difficult to believe.  It seems too high by an order of
magnitude.  Another thing that doesn't get mentioned much is that classic
mainframes had SRAM, so their memory bandwidth was enormous.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230308/5cebf407/attachment.htm>

From crossd at gmail.com  Fri Mar 10 05:55:44 2023
From: crossd at gmail.com (Dan Cross)
Date: Thu, 9 Mar 2023 14:55:44 -0500
Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways
In-Reply-To: <CAD2gp_R=uqX4VidxTik8d1L-KYSsp4KwdMSFXx2kJdrCJ6ojbQ@mail.gmail.com>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
 <CAD2gp_R=uqX4VidxTik8d1L-KYSsp4KwdMSFXx2kJdrCJ6ojbQ@mail.gmail.com>
Message-ID: <CAEoi9W7WWJbj-OTF2as-RK2ZZUeOugZ=MDnbdEJ7v_i7RGAcGg@mail.gmail.com>

On Wed, Mar 8, 2023 at 8:22 PM John Cowan <cowan at ccil.org> wrote:
> On Wed, Mar 8, 2023 at 2:53 PM Dan Cross <crossd at gmail.com> wrote:
>> But the
>> 3090 was really more like a distributed system than the Athlon box
>> was, with all sorts of offload capabilities. For that matter, a
>> thousand users probably _could_ telnet into the Athlon system. With
>> telnet in line mode, it'd probably even be decently responsive.
>
> I find that difficult to believe.  It seems too high by an order of magnitude.

I'm not going to claim it would be zippy, but I do think it would work
acceptably.

Suppose that 1000 users telnet'ed into the x86 machine, but remained
essentially idle; what resources would that consume? We'd have 1000
open TCP connections, a thousand shell processes, a thousand
telnetd's, etc. All of that would consume some amount of RAM (though
there'd be a lot of sharing of text and read-only data and so on),
some VM space requiring RAM for paging structures and so on, some
accounting data in the kernel, 1000 pseudo-ttys allocated, entries in
the process table, etc. But, most of those shells would spend most of
their time blocked waiting on input, so wouldn't consume CPU
continuously, and similarly with the TCP connections mostly idle, the
kernel is not generally wasting a lot of processor time on the login
sessions. There'd be some bookkeeping data on disk, but that would be
small. System overhead would amount to maybe a few megabytes, I'd
imagine.

If all of those users ran telnet in line mode, then the system isn't
getting pounded with interrupts all the time, even if they're
executing commands (the per-character overhead would be absorbed by
the client).

I don't think I have a machine of quite the Athlon vintage, but I _do_
have a machine with a Ryzen processor that's a couple of years old
down in my basement. As an experiment, I wrote a little "expect"
script to login to that machine a thousand times, doing so
recursively: that is, the script starts off ssh'ing into the machine,
and then in that session, logs in again, and so on, a thousand times,
before finally going interactive. I used encryption, public-key
authentication, and compression, and bounced through a "jump host" for
each session, ensuring that I'm using the network for each login. The
effect here is that typing into the final shell sort of simulates 1000
users typing simultaneously, complete with all the glorious interrupt
and scheduler overhead that implies.

Response time in that connection is not bad; certainly on par with the
3090 I used for a while in the early 90s. If I login in another
window, it doesn't even register that there are a thousand "users"
logged in, even if I'm running something chatty in the "thousand
users" window.

By contrast, the mainframe required a tremendous amount of offload
support to shield the CPU from all of that bursty user activity. They
made user actions look like block transfers, thus amortizing (much) of
the overhead of interactivity. With the same load, the mainframe is
storing some state data in memory regarding which users are logged in
or connected or dialed or whatever, but the situation isn't that much
different than mostly-idle telnet connections in line-mode: save that
it's even more favorable to the mainframe in that much of the
interaction is per-screen of data, as opposed to per-line.

The difference in interactivity and offload is why I think the
comparison is poor. If the mainframe handled user sessions the same
way the x86 machine handled telnet logins, I imagine it would be
swamped way worse than the AMD machine (or whatever it was that person
was writing about 10 or 15 years ago). Perhaps a better comparison
would be to a web server that was accepting HTTP requests from 1000
different clients. I'm quite sure that x86 machines of the Athlon era
could cope with that load.

> Another thing that doesn't get mentioned much is that classic mainframes had SRAM, so their memory bandwidth was enormous.

I suspect this has less of a difference than one would hope when
comparing against a modern machine.

The specific comparison in this case was against an IBM 3090-600J. It
appears to use SRAM for cache ("high speed buffer" in IBM-speak), but
seems to use DRAM for central and expanded storage. In this reference
I found on bitsavers, they make a big deal about their "one million
bit memory chip", but that's DRAM
(http://www.bitsavers.org/pdf/ibm/3090/G580-1005-0_The_IBM_3090_Processor_Family_Jul87.pdf;
see "IBM Advances the Technology" on page 10).

Moreover, that machine supported up to 6 CPUs running at a clock rate
of 69 MHz. That same reference says they could bring cycle times down
to 17.2ns using ECL chips; DDR2 can match that. My Mac Studio blows it
out of the water.

For systems older than the 3090, I'm not sure that the SRAM difference
matters much at all: those machines had tiny memories compared to even
modern cell phones, and their CPUs and buses were pitifully slow. Even
if they had more RAM bandwidth than machines now (which I do not think
is really true), they couldn't use it. Indeed, I suspect their total
memory sizes were smaller than L3 cache (which is SRAM) on modern
machines.

        - Dan C.

From lm at mcvoy.com  Fri Mar 10 06:09:32 2023
From: lm at mcvoy.com (Larry McVoy)
Date: Thu, 9 Mar 2023 12:09:32 -0800
Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways
In-Reply-To: <CAEoi9W7WWJbj-OTF2as-RK2ZZUeOugZ=MDnbdEJ7v_i7RGAcGg@mail.gmail.com>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
 <CAD2gp_R=uqX4VidxTik8d1L-KYSsp4KwdMSFXx2kJdrCJ6ojbQ@mail.gmail.com>
 <CAEoi9W7WWJbj-OTF2as-RK2ZZUeOugZ=MDnbdEJ7v_i7RGAcGg@mail.gmail.com>
Message-ID: <20230309200932.GK9225@mcvoy.com>

On Thu, Mar 09, 2023 at 02:55:44PM -0500, Dan Cross wrote:
> On Wed, Mar 8, 2023 at 8:22???PM John Cowan <cowan at ccil.org> wrote:
> > On Wed, Mar 8, 2023 at 2:53???PM Dan Cross <crossd at gmail.com> wrote:
> >> But the
> >> 3090 was really more like a distributed system than the Athlon box
> >> was, with all sorts of offload capabilities. For that matter, a
> >> thousand users probably _could_ telnet into the Athlon system. With
> >> telnet in line mode, it'd probably even be decently responsive.
> >
> > I find that difficult to believe.  It seems too high by an order of magnitude.
> 
> I'm not going to claim it would be zippy, but I do think it would work
> acceptably.
> 
> Suppose that 1000 users telnet'ed into the x86 machine, but remained
> essentially idle; what resources would that consume? We'd have 1000
> open TCP connections, a thousand shell processes, a thousand
> telnetd's, etc. 

The early Unix code really did not like stuff like this.  Lots of linear
scans through what were assumed to be short lists.  I still remember an
SGI Challenge being brought to it's knees by a bunch of racks of modems.
The same machine could move a ton of data but not when it was being
forced through a zillion sockets.

Linux seems well past that problem but it's possible that back in the
Athlon days it still sucked.  I pinged Linus, if he remembers when the
kernel got taught to scale on sockets I'll report back.

--lm

From stewart at serissa.com  Sat Mar 11 00:20:48 2023
From: stewart at serissa.com (Larry Stewart)
Date: Fri, 10 Mar 2023 09:20:48 -0500
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <20230310131512.891A8212A8@orac.inputplus.co.uk>
References: <20230310131512.891A8212A8@orac.inputplus.co.uk>
Message-ID: <498576F7-6881-4176-B187-F4ACB0A42F76@serissa.com>

TLDR exceptions don't make it better, they make it different.

The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions.

The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on.
At the application level, literally anything can happen on any call.

The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM.  In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise.

This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost.  So no one did this, and at the top level, literally any exception could occur.

Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases.

On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity.

I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author.

The usual practice of course is the popup "Received unknown error, OK?"

-Larry

> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
> 
> Hi Noel,
> 
>>> if you say above that most people are unfamiliar with them due to
>>> their use of goto then that's probably wrong
>> 
>> I didn't say that.
> 
> Thanks for clarifying; I did know it was a possibility.
> 
>> I was just astonished that in a long thread about handling exceptional
>> conditions, nobody had mentioned . . . exceptions.  Clearly, either
>> unfamiliarity (perhaps because not many laguages provide them - as you
>> point out, Go does not), or not top of mind.
> 
> Or perhaps those happy to use gotos also tend to be those who dislike
> exceptions.  :-)
> 
> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF.
> 
> -- 
> Cheers, Ralph.


From bakul at iitbombay.org  Sat Mar 11 03:11:25 2023
From: bakul at iitbombay.org (Bakul Shah)
Date: Fri, 10 Mar 2023 09:11:25 -0800
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
Message-ID: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org>

To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient).

>  On Mar 10, 2023, at 6:21 AM, Larry Stewart <stewart at serissa.com> wrote:
> TLDR exceptions don't make it better, they make it different.
> 
> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions.
> 
> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on.
> At the application level, literally anything can happen on any call.
> 
> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM.  In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise.
> 
> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost.  So no one did this, and at the top level, literally any exception could occur.
> 
> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases.
> 
> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity.
> 
> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author.
> 
> The usual practice of course is the popup "Received unknown error, OK?"
> 
> -Larry
> 
>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>> 
>> Hi Noel,
>> 
>>>> if you say above that most people are unfamiliar with them due to
>>>> their use of goto then that's probably wrong
>>> I didn't say that.
>> 
>> Thanks for clarifying; I did know it was a possibility.
>> 
>>> I was just astonished that in a long thread about handling exceptional
>>> conditions, nobody had mentioned . . . exceptions.  Clearly, either
>>> unfamiliarity (perhaps because not many laguages provide them - as you
>>> point out, Go does not), or not top of mind.
>> 
>> Or perhaps those happy to use gotos also tend to be those who dislike
>> exceptions.  :-)
>> 
>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF.
>> 
>> -- 
>> Cheers, Ralph.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230310/c8c99a23/attachment.htm>

From coff at tuhs.org  Sat Mar 11 03:28:44 2023
From: coff at tuhs.org (segaloco via COFF)
Date: Fri, 10 Mar 2023 17:28:44 +0000
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org>
References: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org>
Message-ID: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>

On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice.

If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code...

- Matt G.
------- Original Message -------
On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah <bakul at iitbombay.org> wrote:

> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient).
>
>> On Mar 10, 2023, at 6:21 AM, Larry Stewart <stewart at serissa.com> wrote:
>
>> TLDR exceptions don't make it better, they make it different.
>>
>> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions.
>>
>> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on.
>> At the application level, literally anything can happen on any call.
>>
>> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise.
>>
>> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur.
>>
>> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases.
>>
>> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity.
>>
>> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author.
>>
>> The usual practice of course is the popup "Received unknown error, OK?"
>>
>> -Larry
>>
>>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>>
>>>
>>
>>> Hi Noel,
>>
>>>
>>
>>>>> if you say above that most people are unfamiliar with them due to
>>
>>>>> their use of goto then that's probably wrong
>>
>>>>
>>
>>>> I didn't say that.
>>
>>>
>>
>>> Thanks for clarifying; I did know it was a possibility.
>>
>>>
>>
>>>> I was just astonished that in a long thread about handling exceptional
>>
>>>> conditions, nobody had mentioned . . . exceptions. Clearly, either
>>
>>>> unfamiliarity (perhaps because not many laguages provide them - as you
>>
>>>> point out, Go does not), or not top of mind.
>>
>>>
>>
>>> Or perhaps those happy to use gotos also tend to be those who dislike
>>
>>> exceptions. :-)
>>
>>>
>>
>>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF.
>>
>>>
>>
>>> --
>>
>>> Cheers, Ralph.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230310/5a340434/attachment-0001.htm>

From lm at mcvoy.com  Sat Mar 11 03:34:53 2023
From: lm at mcvoy.com (Larry McVoy)
Date: Fri, 10 Mar 2023 09:34:53 -0800
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
References: <445734B5-EFA3-4224-8ABE-81D35A4D9CEC@iitbombay.org>
 <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
Message-ID: <20230310173453.GA9225@mcvoy.com>

On Fri, Mar 10, 2023 at 05:28:44PM +0000, segaloco via COFF wrote:
> On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice.

#include <assert.h>

From imp at bsdimp.com  Sat Mar 11 03:34:57 2023
From: imp at bsdimp.com (Warner Losh)
Date: Fri, 10 Mar 2023 10:34:57 -0700
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <20230310131512.891A8212A8@orac.inputplus.co.uk>
References: <20230310121550.9A80718C080@mercury.lcs.mit.edu>
 <20230310131512.891A8212A8@orac.inputplus.co.uk>
Message-ID: <CANCZdfob3HM34_t2OjGqUbt-p6LEQfbWDQ6-Ttc0O0tL=6dWog@mail.gmail.com>

On Fri, Mar 10, 2023 at 6:15 AM Ralph Corderoy <ralph at inputplus.co.uk>
wrote:

> Hi Noel,
>
> > > if you say above that most people are unfamiliar with them due to
> > > their use of goto then that's probably wrong
> >
> > I didn't say that.
>
> Thanks for clarifying; I did know it was a possibility.
>

Exception handling is a great leap sideways. it's a supercharged goto with
steroids on top. In some ways more constrained, in other ways more prone to
abuse.

Example:

I diagnosed performance problems in a program that would call into
'waiting' threads that would read data from a pipe and then queue work.
Easy, simple, straightforward design. Except they used exceptions to then
process the packets rather than having a proper lockless producer /
consumer queue.

Exceptions are great for keeping the code linear and ignoring error
conditions logically, but still having them handled "somewhere" above the
current code and writing the code such that when it gets an abort, partial
work is cleaned up and trashed.

Global exception handlers are both good and bad. All errors become
tracebacks to where it occurred. People often don't disambiguate between
expected and unexpected exceptions, so programming errors get lumped in
with remote devices committing protocol errors get lumped in with your
config file had a typo and /dve/ttyU2 doesn't exist. It can be hard for the
user to know what comes next when it's all jumbled together. In-line error
handling, at least, can catch the expected things and give a more
reasonable error near to where it happened so I know if my next step is vi
prog.conf or email support at prog.com.

So it's a hate hate relationship with both. What do I hate the least?
That's a three drink minimum for the answer.

Warner
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230310/f0d8553b/attachment.htm>

From bakul at iitbombay.org  Sat Mar 11 03:35:44 2023
From: bakul at iitbombay.org (Bakul Shah)
Date: Fri, 10 Mar 2023 09:35:44 -0800
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
Message-ID: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>

During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time!

> On Mar 10, 2023, at 9:28 AM, segaloco <segaloco at protonmail.com> wrote:
> 
> 
> On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice.
> 
> If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code...
> 
> - Matt G.
> ------- Original Message -------
> On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah <bakul at iitbombay.org> wrote:
> 
>> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient).
>> 
>>>  On Mar 10, 2023, at 6:21 AM, Larry Stewart <stewart at serissa.com> wrote:
>>> 
>>> TLDR exceptions don't make it better, they make it different.
>>> 
>>> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions.
>>> 
>>> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on.
>>> At the application level, literally anything can happen on any call.
>>> 
>>> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM.  In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise.
>>> 
>>> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost.  So no one did this, and at the top level, literally any exception could occur.
>>> 
>>> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases.
>>> 
>>> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity.
>>> 
>>> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author.
>>> 
>>> The usual practice of course is the popup "Received unknown error, OK?"
>>> 
>>> -Larry
>>> 
>>>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>>>> 
>>>> Hi Noel,
>>>> 
>>>>>> if you say above that most people are unfamiliar with them due to
>>>>>> their use of goto then that's probably wrong
>>>>> 
>>>>> I didn't say that.
>>>> 
>>>> Thanks for clarifying; I did know it was a possibility.
>>>> 
>>>>> I was just astonished that in a long thread about handling exceptional
>>>>> conditions, nobody had mentioned . . . exceptions.  Clearly, either
>>>>> unfamiliarity (perhaps because not many laguages provide them - as you
>>>>> point out, Go does not), or not top of mind.
>>>> 
>>>> Or perhaps those happy to use gotos also tend to be those who dislike
>>>> exceptions.  :-)
>>>> 
>>>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF.
>>>> 
>>>> -- 
>>>> Cheers, Ralph.
>>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230310/1160fd61/attachment.htm>

From lm at mcvoy.com  Sat Mar 11 03:42:22 2023
From: lm at mcvoy.com (Larry McVoy)
Date: Fri, 10 Mar 2023 09:42:22 -0800
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
Message-ID: <20230310174222.GB9225@mcvoy.com>

On Fri, Mar 10, 2023 at 09:35:44AM -0800, Bakul Shah wrote:
> During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don???t want to run a program *under a debugger* but want it invoked at the right time!

Indeed.

void
gdb_backtrace(void)
{
        FILE    *f;
        char    *cmd;

        unless (getenv("_BK_BACKTRACE")) return;
        unless ((f = efopen("BK_TTYPRINTF")) ||
            (f = fopen(DEV_TTY, "w"))) {
                f = stderr;
        }
        cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d",
            bin, getpid(), fileno(f), fileno(f));

        system(cmd);
        free(cmd);
        if (f != stderr) fclose(f);
}


From coff at tuhs.org  Sat Mar 11 03:43:28 2023
From: coff at tuhs.org (segaloco via COFF)
Date: Fri, 10 Mar 2023 17:43:28 +0000
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
Message-ID: <VnBzbJV0nwdFrThDKdDdfO66g4Wyi3VmgTQgSw6vQaamv1bbF4VJyFtZMJHyWKmqD9CSri6rpjJEwdBgYvW9P_3zuwMYOWo-56DBSgfGsHY=@protonmail.com>

Yeah it's a pain and different in different languages. My horror stories are mainly C# since that's what day job stuff is these days (backend anyway). The way assert does it is great, one little cpp define and it all goes away. However that being compile time, only applies to what is yours, if you're stuck with someone else's object code, you get what you get :/

- Matt G.
------- Original Message -------
On Friday, March 10th, 2023 at 9:35 AM, Bakul Shah <bakul at iitbombay.org> wrote:

> During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time!
>
>> On Mar 10, 2023, at 9:28 AM, segaloco <segaloco at protonmail.com> wrote:
>
>> 
>> On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice.
>>
>> If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code...
>>
>> - Matt G.
>> ------- Original Message -------
>> On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah <bakul at iitbombay.org> wrote:
>>
>>> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient).
>>>
>>>> On Mar 10, 2023, at 6:21 AM, Larry Stewart <stewart at serissa.com> wrote:
>>>
>>>> TLDR exceptions don't make it better, they make it different.
>>>>
>>>> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions.
>>>>
>>>> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on.
>>>> At the application level, literally anything can happen on any call.
>>>>
>>>> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM. In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise.
>>>>
>>>> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost. So no one did this, and at the top level, literally any exception could occur.
>>>>
>>>> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases.
>>>>
>>>> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity.
>>>>
>>>> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author.
>>>>
>>>> The usual practice of course is the popup "Received unknown error, OK?"
>>>>
>>>> -Larry
>>>>
>>>>> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>>>>
>>>>>
>>>>
>>>>> Hi Noel,
>>>>
>>>>>
>>>>
>>>>>>> if you say above that most people are unfamiliar with them due to
>>>>
>>>>>>> their use of goto then that's probably wrong
>>>>
>>>>>>
>>>>
>>>>>> I didn't say that.
>>>>
>>>>>
>>>>
>>>>> Thanks for clarifying; I did know it was a possibility.
>>>>
>>>>>
>>>>
>>>>>> I was just astonished that in a long thread about handling exceptional
>>>>
>>>>>> conditions, nobody had mentioned . . . exceptions. Clearly, either
>>>>
>>>>>> unfamiliarity (perhaps because not many laguages provide them - as you
>>>>
>>>>>> point out, Go does not), or not top of mind.
>>>>
>>>>>
>>>>
>>>>> Or perhaps those happy to use gotos also tend to be those who dislike
>>>>
>>>>> exceptions. :-)
>>>>
>>>>>
>>>>
>>>>> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF.
>>>>
>>>>>
>>>>
>>>>> --
>>>>
>>>>> Cheers, Ralph.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230310/e6b14061/attachment.htm>

From bakul at iitbombay.org  Sat Mar 11 03:47:29 2023
From: bakul at iitbombay.org (Bakul Shah)
Date: Fri, 10 Mar 2023 09:47:29 -0800
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
Message-ID: <1DCF3FAD-ADAA-4FEC-8A76-739DF67A4859@iitbombay.org>

I should add that (compared to goto or setjmp/longjmp), by making exceptions a language thing, the compiler can attach more context to the exception event (or condition). In the scheme I outlined, the vendor library function must declare what exceptions it doesn’t handle and the compiler can pass more context that may not make sense to a library user but may help its developer pinpoint the cause.

> On Mar 10, 2023, at 9:28 AM, segaloco <segaloco at protonmail.com> wrote:
> 
> If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code...

From crossd at gmail.com  Sat Mar 11 04:03:23 2023
From: crossd at gmail.com (Dan Cross)
Date: Fri, 10 Mar 2023 13:03:23 -0500
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
Message-ID: <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>

On Fri, Mar 10, 2023 at 12:36 PM Bakul Shah <bakul at iitbombay.org> wrote:
> During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time!

Common Lisp implementations have been doing that for years! Too bad
using Lisp means bringing all the rest of the Lisp stuff with it,
including the attitude. Oh well. :-)

        - Dan C.


> On the flip side something I've always thought would be powerful, at least in development, is a way to tell any and all procedures being called to ignore their exception/condition handling and hard-crash. Of course you don't want to take that kind of hammer to a production situation, but a way to override any and all handling so that real errors become apparent would be incredibly nice.
>
> If nothing else, I could provide much better stack traces to vendors when I'm particularly stuck on something and convinced it isn't my fault. Maybe such a thing exists in C# but I've never gone looking for it, all I know is catching an exception from some vendor library with zero useful information makes me want to take a hammer to much more than the code...
>
> - Matt G.
> ------- Original Message -------
> On Friday, March 10th, 2023 at 9:11 AM, Bakul Shah <bakul at iitbombay.org> wrote:
>
> To make exceptional handling robust, I think every exception needs to be explicitly handled somewhere. If an exception not handled by a function, that fact must be specified in the function declaration. In effect the compiler can check that every exception has a handler somewhere. I think you can implement it using different syntactic sugar than Go’s obnoxious error handling but basically the same (though you may be tempted to make more efficient).
>
>  On Mar 10, 2023, at 6:21 AM, Larry Stewart <stewart at serissa.com> wrote:
>
> TLDR exceptions don't make it better, they make it different.
>
> The Mesa and Cedar languages at PARC CSL were intended to be "Systems Languages" and fully embraced exceptions.
>
> The problem is that it is extremely tempting for the author of a library to use them, and equally tempting for the authors of library calls used by the first library, and so on.
> At the application level, literally anything can happen on any call.
>
> The Cedar OS was a library OS, where applications ran in the same address space, since there was no VM.  In 1982 or so I set out to write a shell for it, and was determined that regardless of what happened, the shell should not crash, so I set out to guard every single call with handlers for every exception they could raise.
>
> This was an immensely frustrating process because while the language suggested that the author of a library capture exceptions on the way by and translate them to one at the package level, this is a terrible idea in its own way, because you can't debug - the state of the ultimate problem was lost.  So no one did this, and at the top level, literally any exception could occur.
>
> Another thing that happens with exceptions is that programmers get the bright idea to use them for conditions which are uncommon, but expected, so any user of the function has to write complicated code to deal with these cases.
>
> On the whole, I came away with a great deal of grudging respect for ERRNO as striking a great balance between ease of use and specificity.
>
> I also evolved Larry's Theory of Exceptions, which is that it is the programmer's job to sort exceptional conditions into actionable categories: (1) resolvable by the user (bad arguments) (2) Temporary (out of network sockets or whatever) (3) resolvable by the sysadmin (config) (4) real bug, resolvable by the author.
>
> The usual practice of course is the popup "Received unknown error, OK?"
>
> -Larry
>
> On Mar 10, 2023, at 8:15 AM, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>
>
> Hi Noel,
>
>
> if you say above that most people are unfamiliar with them due to
>
> their use of goto then that's probably wrong
>
>
> I didn't say that.
>
>
> Thanks for clarifying; I did know it was a possibility.
>
>
> I was just astonished that in a long thread about handling exceptional
>
> conditions, nobody had mentioned . . . exceptions.  Clearly, either
>
> unfamiliarity (perhaps because not many laguages provide them - as you
>
> point out, Go does not), or not top of mind.
>
>
> Or perhaps those happy to use gotos also tend to be those who dislike
>
> exceptions.  :-)
>
>
> Anyway, I'm off-TUHS-pic so follow-ups set to goto COFF.
>
>
> --
>
> Cheers, Ralph.
>
>
>

From steffen at sdaoden.eu  Sat Mar 11 04:09:38 2023
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Fri, 10 Mar 2023 19:09:38 +0100
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <498576F7-6881-4176-B187-F4ACB0A42F76@serissa.com>
References: <20230310131512.891A8212A8@orac.inputplus.co.uk>
 <498576F7-6881-4176-B187-F4ACB0A42F76@serissa.com>
Message-ID: <20230310180938.6rYu2%steffen@sdaoden.eu>

Larry Stewart wrote in
 <498576F7-6881-4176-B187-F4ACB0A42F76 at serissa.com>:
 |TLDR exceptions don't make it better, they make it different.
 ...
 |On the whole, I came away with a great deal of grudging respect for \
 |ERRNO as striking a great balance between ease of use and specificity.

From my user space point of view i never understood why there is
no dedicated hardware register / (plus) error indicating flag that
callers could cheaply and easily test.  (Maybe there is on some
processor platforms, beside a one such where errno then can be
placed in some per-thread structure stored there.  Still this
requires another dedicated return value.)

I ran away from the exceptions i got used to with JAVA to
-fno-rtti -fno-exceptions when i looked at the object output of
g++ 2.95.?, and saw in the support code they use heap memory for
this etc.

 |I also evolved Larry's Theory of Exceptions, which is that it is the \
 |programmer's job to sort exceptional conditions into actionable categori\
 |es: (1) resolvable by the user (bad arguments) (2) Temporary (out of \
 |network sockets or whatever) (3) resolvable by the sysadmin (config) \
 |(4) real bug, resolvable by the author.
  ...

Really interesting point, like SMTP and other protocols which
classify errors in categories.
Errors are one of my waving-helplessly topics, where you simply
have to let things go and where "perfection" just cannot be
achieved in real-life (or add .. as time passes by).
Often you just do not find the correct answer, with errno the name
sometimes fits, but the decade-old description does not really,
and very fast you end up with overloading (eg come to a second
ENODATA because ESRCH is something different, or reuse EILSEQ for
bogus input even though the function already used to use EILSEQ
for non-convertible output).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From bakul at iitbombay.org  Sat Mar 11 04:57:13 2023
From: bakul at iitbombay.org (Bakul Shah)
Date: Fri, 10 Mar 2023 10:57:13 -0800
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>
Message-ID: <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org>

On Mar 10, 2023, at 10:03 AM, Dan Cross <crossd at gmail.com> wrote:
> 
> On Fri, Mar 10, 2023 at 12:36 PM Bakul Shah <bakul at iitbombay.org> wrote:
>> During development the runtime should simply invoke a debugger in this case. This should be perfectly doable but for some reason it is considered acceptable to crash a program! I don’t want to run a program *under a debugger* but want it invoked at the right time!
> 
> Common Lisp implementations have been doing that for years! Too bad
> using Lisp means bringing all the rest of the Lisp stuff with it,
> including the attitude. Oh well. :-)

It can even fix the problem and continue!

Note that such things don't have to be *tied* to Lisp. But that
would require a change in mindset.

From marzhall.o at gmail.com  Sat Mar 11 05:57:40 2023
From: marzhall.o at gmail.com (Marshall Conover)
Date: Fri, 10 Mar 2023 14:57:40 -0500
Subject: [COFF] [TUHS] Re: Conditions,
 AKA exceptions. (Was: I can't drive 55: "GOTO considered harmful"
 55th anniversary)
In-Reply-To: <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>
 <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org>
Message-ID: <CAK0pxsEeY3dXkcvw8Zb=3brkGpwYNuYrWLMi1YwpuM2SEH9+Ew@mail.gmail.com>

While all this error and exception discussion is going down, I have to
mention this piece: http://joeduffyblog.com/2016/02/07/the-error-model/

The author worked at MS on their "midori" research OS, and discussed what
went into their decisions around using return codes, exceptions, etc. I
felt it was a nice breakdown of the pros and cons of the different
approaches, and fleshed out the concepts in my mind a bit. I thought others
might enjoy it as well.

That said, I absolutely loathe exceptions with all my heart. In my
experience, along Warner and Matt's lines, they're more prone to the sort
of abuse that wastes my time than they are productive. It's not that they
can't be used well, they just so often aren't.

Cheers,

Marshall

On Fri, Mar 10, 2023 at 1:57 PM Bakul Shah <bakul at iitbombay.org> wrote:

> On Mar 10, 2023, at 10:03 AM, Dan Cross <crossd at gmail.com> wrote:
> >
> > On Fri, Mar 10, 2023 at 12:36 PM Bakul Shah <bakul at iitbombay.org> wrote:
> >> During development the runtime should simply invoke a debugger in this
> case. This should be perfectly doable but for some reason it is considered
> acceptable to crash a program! I don’t want to run a program *under a
> debugger* but want it invoked at the right time!
> >
> > Common Lisp implementations have been doing that for years! Too bad
> > using Lisp means bringing all the rest of the Lisp stuff with it,
> > including the attitude. Oh well. :-)
>
> It can even fix the problem and continue!
>
> Note that such things don't have to be *tied* to Lisp. But that
> would require a change in mindset.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230310/34da2ab8/attachment.htm>

From ralph at inputplus.co.uk  Sat Mar 11 21:25:08 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sat, 11 Mar 2023 11:25:08 +0000
Subject: [COFF] continue N. (Was: I can't drive 55...)
In-Reply-To: <20230310165552.czZmL%steffen@sdaoden.eu>
References: <20230309230130.q4I-f%steffen@sdaoden.eu>
 <RHglcvrCs_xHL1NWcH8pkN1l-IQKszfwYFc8sXRTQRRjGa-7kVJWj-OMzOmWkR89xFwEIiWfoLoAAEHkKF8etYwLR3aGRj9VuwO_MhYbz3s=@protonmail.com>
 <CANCZdfpj_uBkH=hih2Kv+YwTQ-eHOzqgDUJ_+P8a4chOHu=cLQ@mail.gmail.com>
 <alpine.BSF.2.21.9999.2303101657460.4881@aneurin.horsfall.org>
 <20230310165552.czZmL%steffen@sdaoden.eu>
Message-ID: <20230311112508.7306220145@orac.inputplus.co.uk>

Hi Steffen,

COFF'd.

> Very often i find myself needing a restart necessity, so "continue
> N" would that be.  Then again when "N" is a number instead of
> a label this is a (let alone maintainance) mess but for shortest
> code paths.

Do you mean ‘continue’ which re-tests the condition or more like Perl's
‘redo’ which re-starts the loop's body?

   ‘The "redo" command restarts the loop block without evaluating the
    conditional again.  The "continue" block, if any, is not executed.’
        — perldoc -f redo

So like a ‘goto redo’ in

        while (...) {
    redo:
            ...
            if (...)
                goto redo
            ...
        }

-- 
Cheers, Ralph.

From ralph at inputplus.co.uk  Sat Mar 11 21:28:49 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sat, 11 Mar 2023 11:28:49 +0000
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <20230310174222.GB9225@mcvoy.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <20230310174222.GB9225@mcvoy.com>
Message-ID: <20230311112849.22C0920145@orac.inputplus.co.uk>

Hi Larry,

>         cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d",
>             bin, getpid(), fileno(f), fileno(f));
>
>         system(cmd);

I also came up with this, probably on an SGI Iris Indigo, and got it
added to the Unix Programming FAQ.  :-)

    6.5 How can I generate a stack dump from within a running program?
    http://www.faqs.org/faqs/unix-faq/programmer/faq/

It works surprisingly often, i.e. the process is healthy enough to run
system(3).

-- 
Cheers, Ralph.

From paul.winalski at gmail.com  Sun Mar 12 01:42:51 2023
From: paul.winalski at gmail.com (Paul Winalski)
Date: Sat, 11 Mar 2023 10:42:51 -0500
Subject: [COFF] continue N. (Was: I can't drive 55...)
In-Reply-To: <20230311112508.7306220145@orac.inputplus.co.uk>
References: <20230309230130.q4I-f%steffen@sdaoden.eu>
 <RHglcvrCs_xHL1NWcH8pkN1l-IQKszfwYFc8sXRTQRRjGa-7kVJWj-OMzOmWkR89xFwEIiWfoLoAAEHkKF8etYwLR3aGRj9VuwO_MhYbz3s=@protonmail.com>
 <CANCZdfpj_uBkH=hih2Kv+YwTQ-eHOzqgDUJ_+P8a4chOHu=cLQ@mail.gmail.com>
 <alpine.BSF.2.21.9999.2303101657460.4881@aneurin.horsfall.org>
 <20230310165552.czZmL%steffen@sdaoden.eu>
 <20230311112508.7306220145@orac.inputplus.co.uk>
Message-ID: <CABH=_VQiYKJgLBKrYRm0zkDQyUytbOPhhqejfAkvtPUMF_p6zw@mail.gmail.com>

Regarding the general subject of using GOTOs:

The first computer on which I did hands-on programming was an IBM
S/360 model 25.  It had 32K of memory available for user
programs--that's both instructions and data.  It executed code at
about a 30 KIPS (yes--KILO instructions/second) rate.  When you're
programming on a machine that is that slow and with that limited an
address space, every instruction counts.  You couldn't afford either
the space or the time to execute conditional tests just to avoid a
GOTO.

Programming using GOTOs doesn't necessarily mean you're writing rat's
nest or spaghetti code.  Yes, you can make a mess using GOTOs, and
perhaps messy code is easier when GOTOs are allowed, but structured
programming just for its own sake can lead to convoluted and messy
program structure as well.  What was rat's nest control flow with
GOTOs can turn into rat's nest data flow of state variables.

It's also worth noting that one of the main functions of a modern
optimizing compiler is to take your nice, structured program and put
all those rat's nest GOTOs (unconditional branch instructions) back so
the thing will execute more quickly.

-Paul W.

From steffen at sdaoden.eu  Sun Mar 12 03:51:02 2023
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Sat, 11 Mar 2023 18:51:02 +0100
Subject: [COFF] continue N. (Was: I can't drive 55...)
In-Reply-To: <20230311112508.7306220145@orac.inputplus.co.uk>
References: <20230309230130.q4I-f%steffen@sdaoden.eu>
 <RHglcvrCs_xHL1NWcH8pkN1l-IQKszfwYFc8sXRTQRRjGa-7kVJWj-OMzOmWkR89xFwEIiWfoLoAAEHkKF8etYwLR3aGRj9VuwO_MhYbz3s=@protonmail.com>
 <CANCZdfpj_uBkH=hih2Kv+YwTQ-eHOzqgDUJ_+P8a4chOHu=cLQ@mail.gmail.com>
 <alpine.BSF.2.21.9999.2303101657460.4881@aneurin.horsfall.org>
 <20230310165552.czZmL%steffen@sdaoden.eu>
 <20230311112508.7306220145@orac.inputplus.co.uk>
Message-ID: <20230311175102.Yl3ha%steffen@sdaoden.eu>

Ralph Corderoy wrote in
 <20230311112508.7306220145 at orac.inputplus.co.uk>:
 |Hi Steffen,
 |
 |COFF'd.
 |
 |> Very often i find myself needing a restart necessity, so "continue
 |> N" would that be.  Then again when "N" is a number instead of
 |> a label this is a (let alone maintainance) mess but for shortest
 |> code paths.
 |
 |Do you mean ‘continue’ which re-tests the condition or more like Perl's
 |‘redo’ which re-starts the loop's body?

No Ralph, i unspecifically meant multiple nested loops where some
inner has to restart/continue the outer (at some point).
So a bit like that of "man perlsyn", but with deeper nesting

       If you need both "next" and "last", you have to do both and also use a
       loop label:

           LOOP: {
               do {{
                   next if $x == $y;
                   last LOOP if $x == $y**2;
                   # do something here
               }} until $x++ > $z;
           }

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From crossd at gmail.com  Sun Mar 12 06:32:12 2023
From: crossd at gmail.com (Dan Cross)
Date: Sat, 11 Mar 2023 15:32:12 -0500
Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways
In-Reply-To: <20230309200932.GK9225@mcvoy.com>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
 <CAD2gp_R=uqX4VidxTik8d1L-KYSsp4KwdMSFXx2kJdrCJ6ojbQ@mail.gmail.com>
 <CAEoi9W7WWJbj-OTF2as-RK2ZZUeOugZ=MDnbdEJ7v_i7RGAcGg@mail.gmail.com>
 <20230309200932.GK9225@mcvoy.com>
Message-ID: <CAEoi9W4E9xvJDZEaYke2zf88tHvGQD9Gafk4FvExNPJLb=rhsQ@mail.gmail.com>

On Thu, Mar 9, 2023 at 3:09 PM Larry McVoy <lm at mcvoy.com> wrote:
> On Thu, Mar 09, 2023 at 02:55:44PM -0500, Dan Cross wrote:
> > On Wed, Mar 8, 2023 at 8:22???PM John Cowan <cowan at ccil.org> wrote:
> > > On Wed, Mar 8, 2023 at 2:53???PM Dan Cross <crossd at gmail.com> wrote:
> > >> But the
> > >> 3090 was really more like a distributed system than the Athlon box
> > >> was, with all sorts of offload capabilities. For that matter, a
> > >> thousand users probably _could_ telnet into the Athlon system. With
> > >> telnet in line mode, it'd probably even be decently responsive.
> > >
> > > I find that difficult to believe.  It seems too high by an order of magnitude.
> >
> > I'm not going to claim it would be zippy, but I do think it would work
> > acceptably.
> >
> > Suppose that 1000 users telnet'ed into the x86 machine, but remained
> > essentially idle; what resources would that consume? We'd have 1000
> > open TCP connections, a thousand shell processes, a thousand
> > telnetd's, etc.
>
> The early Unix code really did not like stuff like this.  Lots of linear
> scans through what were assumed to be short lists.  I still remember an
> SGI Challenge being brought to it's knees by a bunch of racks of modems.
> The same machine could move a ton of data but not when it was being
> forced through a zillion sockets.

Oh for sure I wouldn't try it on a VAX or PDP-11. I'm a bit surprised
by the SGI thing, to be honest, but only a bit: as you say, I think
that was just before the big push to make Unix really scalable.

> Linux seems well past that problem but it's possible that back in the
> Athlon days it still sucked.  I pinged Linus, if he remembers when the
> kernel got taught to scale on sockets I'll report back.

Thanks, I'm curious what he says.

        - Dan C.

From bakul at iitbombay.org  Sun Mar 12 09:28:08 2023
From: bakul at iitbombay.org (Bakul Shah)
Date: Sat, 11 Mar 2023 15:28:08 -0800
Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways
In-Reply-To: <CAEoi9W7WWJbj-OTF2as-RK2ZZUeOugZ=MDnbdEJ7v_i7RGAcGg@mail.gmail.com>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
 <CAD2gp_R=uqX4VidxTik8d1L-KYSsp4KwdMSFXx2kJdrCJ6ojbQ@mail.gmail.com>
 <CAEoi9W7WWJbj-OTF2as-RK2ZZUeOugZ=MDnbdEJ7v_i7RGAcGg@mail.gmail.com>
Message-ID: <8BD706F3-4F50-4836-91ED-10179F06C177@iitbombay.org>

On Mar 9, 2023, at 11:55 AM, Dan Cross <crossd at gmail.com> wrote:
> 
> Suppose that 1000 users telnet'ed into the x86 machine, but remained
> essentially idle; what resources would that consume? We'd have 1000
> open TCP connections, a thousand shell processes, a thousand
> telnetd's, etc. All of that would consume some amount of RAM (though
> there'd be a lot of sharing of text and read-only data and so on),
> some VM space requiring RAM for paging structures and so on, some
> accounting data in the kernel, 1000 pseudo-ttys allocated, entries in
> the process table, etc. But, most of those shells would spend most of
> their time blocked waiting on input, so wouldn't consume CPU
> continuously, and similarly with the TCP connections mostly idle, the
> kernel is not generally wasting a lot of processor time on the login
> sessions. There'd be some bookkeeping data on disk, but that would be
> small. System overhead would amount to maybe a few megabytes, I'd
> imagine.

Not the same but in 1995 at Real Networks our server s/w running on
a 50Mhz or 100Mhz Pentium could handle 1000 TCP control connections
(mostly idle) and 1000 UDP "streams", each sending 10 packets/second,
which was the limiting factor. IIRC we had reduced per socket tcp
send/recv buffer size to a small number. I don't recall now whether
these machines had more than 16GB but we didn't want to tie up lots
of memory in idle buffers.

We got a real boost in traffic in Oct'95 when people all over the
world wanted to know the verdict in O.J.Simpson's murder trial in
real time! After that I added code for feeding live streams to any
downstream servers so that theoretically a 3 level distribution
tree can deliver live data to a billion people.

From tytso at mit.edu  Sun Mar 12 14:23:48 2023
From: tytso at mit.edu (Theodore Ts'o)
Date: Sat, 11 Mar 2023 23:23:48 -0500
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <20230311112849.22C0920145@orac.inputplus.co.uk>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <20230310174222.GB9225@mcvoy.com>
 <20230311112849.22C0920145@orac.inputplus.co.uk>
Message-ID: <20230312042348.GJ860405@mit.edu>

On Sat, Mar 11, 2023 at 11:28:49AM +0000, Ralph Corderoy wrote:
> Hi Larry,
> 
> >         cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d",
> >             bin, getpid(), fileno(f), fileno(f));
> >
> >         system(cmd);
> 
> I also came up with this, probably on an SGI Iris Indigo, and got it
> added to the Unix Programming FAQ.  :-)
> 
>     6.5 How can I generate a stack dump from within a running program?
>     http://www.faqs.org/faqs/unix-faq/programmer/faq/
> 
> It works surprisingly often, i.e. the process is healthy enough to run
> system(3).

On Linux (or some other system using glibc) a limited facility is
built into the C library.  So you can just do somthing like this:

       {
	       void *stack_syms[32];
	       int frames;

	       frames = backtrace(stack_syms, 32);
	       backtrace_symbols_fd(stack_syms, frames, 2);
       }

This is convenient if you want a stack trace, but the binary might be
on a rescue floppy which doesn't have space for gdb, or the user might
not have gdb installed.  I use this for the fsck for ext4, and the
nice thing is that even with a stripped binary.

For example:

Signal (7) SIGBUS (sent from pid 4261) si_code=SI_USER 
e2fsck(+0x36691)[0x564da1ed2691]
/lib/x86_64-linux-gnu/libc.so.6(+0x3bf90)[0x7f6e21c0bf90]
/lib/x86_64-linux-gnu/libc.so.6(read+0xd)[0x7f6e21cc80ed]
e2fsck(ask_yn+0x1de)[0x564da1ec90de]
e2fsck(fix_problem+0xfc0)[0x564da1ecc7b0]
e2fsck(+0x235b3)[0x564da1ebf5b3]
e2fsck(+0x252d3)[0x564da1ec12d3]
/lib/x86_64-linux-gnu/libext2fs.so.2(ext2fs_dblist_iterate3+0x5f)[0x7f6e21e430cf]
e2fsck(e2fsck_pass2+0x18b)[0x564da1ebdd7b]
e2fsck(e2fsck_run+0x5a)[0x564da1eb0c3a]
e2fsck(main+0x16cb)[0x564da1eacdbb]
/lib/x86_64-linux-gnu/libc.so.6(+0x2718a)[0x7f6e21bf718a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f6e21bf7245]
e2fsck(_start+0x21)[0x564da1eaefc1]

For more information see:

https://github.com/tytso/e2fsprogs/blob/master/e2fsck/sigcatcher.c#L379

							- Ted

From ralph at inputplus.co.uk  Sun Mar 12 20:44:17 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Sun, 12 Mar 2023 10:44:17 +0000
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <20230312042348.GJ860405@mit.edu>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <20230310174222.GB9225@mcvoy.com>
 <20230311112849.22C0920145@orac.inputplus.co.uk>
 <20230312042348.GJ860405@mit.edu>
Message-ID: <20230312104417.DC1DF215AA@orac.inputplus.co.uk>

Hi Ted,

> > >         cmd = aprintf("gdb -batch -ex backtrace '%s/bk' %u 1>&%d 2>&%d",
> > >             bin, getpid(), fileno(f), fileno(f));
...
> > It works surprisingly often, i.e. the process is healthy enough to
> > run system(3).
>
> On Linux (or some other system using glibc) a limited facility is
> built into the C library.  So you can just do somthing like this:
...
> 	       frames = backtrace(stack_syms, 32);
> 	       backtrace_symbols_fd(stack_syms, frames, 2);

Since ’99, yes.  :-)  backtrace(3) says glibc 2.1 added it.

> I use this for the fsck for ext4, and the nice thing is that even with
> a stripped binary.
>
> For example:

Yes, that is nice.

> https://github.com/tytso/e2fsprogs/blob/master/e2fsck/sigcatcher.c#L379

Thanks, I've made a note.

Do you ever find things are so messed up that stdio has trouble whereas
using write(2) with compile-time memory allocations for a buffer would
have a better chance of reaching the TTY?

-- 
Cheers, Ralph.

From paul.winalski at gmail.com  Mon Mar 13 02:46:40 2023
From: paul.winalski at gmail.com (Paul Winalski)
Date: Sun, 12 Mar 2023 12:46:40 -0400
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <20230312104417.DC1DF215AA@orac.inputplus.co.uk>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <20230310174222.GB9225@mcvoy.com>
 <20230311112849.22C0920145@orac.inputplus.co.uk>
 <20230312042348.GJ860405@mit.edu>
 <20230312104417.DC1DF215AA@orac.inputplus.co.uk>
Message-ID: <CABH=_VRGbqJo3+ck3p=hqNjumZGuuSFpAQDm=+AoCgeb=CeJqg@mail.gmail.com>

On 3/12/23, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>
> Do you ever find things are so messed up that stdio has trouble whereas
> using write(2) with compile-time memory allocations for a buffer would
> have a better chance of reaching the TTY?

I hate it when that happens.  Even worse is when adding the write(2)
with compile-time memory allocations makes the bug go away.  I once
had to spend three days camped out in someone's office debugging a
compiler crash.  The crash only happened 4 hours into a massive
multi-file compilation, and this guy's login session was the only one
where the problem reproduced under the debugger.  Heisenbugs are hell.

-Paul W.

From lm at mcvoy.com  Mon Mar 13 02:53:19 2023
From: lm at mcvoy.com (Larry McVoy)
Date: Sun, 12 Mar 2023 09:53:19 -0700
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <CABH=_VRGbqJo3+ck3p=hqNjumZGuuSFpAQDm=+AoCgeb=CeJqg@mail.gmail.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <20230310174222.GB9225@mcvoy.com>
 <20230311112849.22C0920145@orac.inputplus.co.uk>
 <20230312042348.GJ860405@mit.edu>
 <20230312104417.DC1DF215AA@orac.inputplus.co.uk>
 <CABH=_VRGbqJo3+ck3p=hqNjumZGuuSFpAQDm=+AoCgeb=CeJqg@mail.gmail.com>
Message-ID: <20230312165319.GN9225@mcvoy.com>

On Sun, Mar 12, 2023 at 12:46:40PM -0400, Paul Winalski wrote:
> On 3/12/23, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
> >
> > Do you ever find things are so messed up that stdio has trouble whereas
> > using write(2) with compile-time memory allocations for a buffer would
> > have a better chance of reaching the TTY?
> 
> I hate it when that happens.  Even worse is when adding the write(2)
> with compile-time memory allocations makes the bug go away.  I once
> had to spend three days camped out in someone's office debugging a
> compiler crash.  The crash only happened 4 hours into a massive
> multi-file compilation, and this guy's login session was the only one
> where the problem reproduced under the debugger.  Heisenbugs are hell.

I had one like that.  Sometimes, rarely, suninstall would throw a

	panic(psig)

which meant that someone in the kernel had messed with the process'
signal mask, which is a no-no.

Turns out that the SCSI twins had heard that people were interrupting
suninstall if it took too long, so under certain conditions, the SCSI
tape driver would disable SIGINT.

It was (obviously) my fault because I was doing POSIX conformance and
I was the last person in many kernel files.

Took me a long time to track that one down.
-- 
---
Larry McVoy           Retired to fishing          http://www.mcvoy.com/lm/boat

From ralph at inputplus.co.uk  Tue Mar 14 02:47:18 2023
From: ralph at inputplus.co.uk (Ralph Corderoy)
Date: Mon, 13 Mar 2023 16:47:18 +0000
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <CAK0pxsEeY3dXkcvw8Zb=3brkGpwYNuYrWLMi1YwpuM2SEH9+Ew@mail.gmail.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>
 <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org>
 <CAK0pxsEeY3dXkcvw8Zb=3brkGpwYNuYrWLMi1YwpuM2SEH9+Ew@mail.gmail.com>
Message-ID: <20230313164718.4169E21F37@orac.inputplus.co.uk>

Hi Marshall,

> While all this error and exception discussion is going down, I have to
> mention this piece:
> http://joeduffyblog.com/2016/02/07/the-error-model/
>
> The author worked at MS on their "midori" research OS, and discussed
> what went into their decisions around using return codes, exceptions,
> etc. I felt it was a nice breakdown of the pros and cons of the
> different approaches, and fleshed out the concepts in my mind a bit.
> I thought others might enjoy it as well.

Thanks, it was a long read but enjoyable.

> That said, I absolutely loathe exceptions with all my heart.

I'm not a fan either.  The exceptions Joe introduces above are more of a
simpler syntax for handling return codes.  He gives the expanded
equivalent at one point.

I also liked his enthusiam for ‘abandonment’, similar to a BUG() macro.

-- 
Cheers, Ralph.

From paul.winalski at gmail.com  Tue Mar 14 03:10:17 2023
From: paul.winalski at gmail.com (Paul Winalski)
Date: Mon, 13 Mar 2023 13:10:17 -0400
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <20230313164718.4169E21F37@orac.inputplus.co.uk>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>
 <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org>
 <CAK0pxsEeY3dXkcvw8Zb=3brkGpwYNuYrWLMi1YwpuM2SEH9+Ew@mail.gmail.com>
 <20230313164718.4169E21F37@orac.inputplus.co.uk>
Message-ID: <CABH=_VSLp52gB1+kszMBe3BgxZ+PM2kJwrQgp-z2fBwzAVNosA@mail.gmail.com>

On 3/13/23, Ralph Corderoy <ralph at inputplus.co.uk> wrote:
>
>> That said, I absolutely loathe exceptions with all my heart.
>
> I'm not a fan either.

Exceptions play merry hell with compiler optimizations.  If you are in
a piece of code where an exception can occur, unless you have
knowledge of the global side-effects of the handler(s) that might get
invoked you must abandon any attempts to do data flow analysis of
global data items.

The C++ Standard Library is fond of using throw and catch exception
handling.  An optimizing compiler pretty much has to throw all data
flow optimization involving global variables, or things passed to a
callee by pointer, if anything in the call chain calls a C++ Standard
Library routine.

>From a compiler writer's perspective, the name STD for the C++
Standard Library is most apt.  STD routines are a disease that infects
anything that touches them.

-Paul W.

From dave at horsfall.org  Tue Mar 14 07:12:53 2023
From: dave at horsfall.org (Dave Horsfall)
Date: Tue, 14 Mar 2023 08:12:53 +1100 (EST)
Subject: [COFF] Conditions, AKA exceptions.
In-Reply-To: <CABH=_VSLp52gB1+kszMBe3BgxZ+PM2kJwrQgp-z2fBwzAVNosA@mail.gmail.com>
References: <zQ8vnI_R8uVwIaiBNo8FIH-OWMoES4VHfnxnnCNISkbFkJcpenXi4H75Z7zFStfclQAue8mixqjL6rWIMBAYRi3bbeGdv0SQUHA9NbQAb44=@protonmail.com>
 <69248852-1701-4938-8A4D-3B27F3018E83@iitbombay.org>
 <CAEoi9W4ny7-FRmyO4BtnqM-pk=F4Qm=kc=Z4qJnL2VAe89fAyA@mail.gmail.com>
 <2CBA9AD7-BC25-40BF-ADE6-A6494D95A4B6@iitbombay.org>
 <CAK0pxsEeY3dXkcvw8Zb=3brkGpwYNuYrWLMi1YwpuM2SEH9+Ew@mail.gmail.com>
 <20230313164718.4169E21F37@orac.inputplus.co.uk>
 <CABH=_VSLp52gB1+kszMBe3BgxZ+PM2kJwrQgp-z2fBwzAVNosA@mail.gmail.com>
Message-ID: <alpine.BSF.2.21.9999.2303140810530.67613@aneurin.horsfall.org>

On Mon, 13 Mar 2023, Paul Winalski wrote:

> From a compiler writer's perspective, the name STD for the C++ Standard 
> Library is most apt.  STD routines are a disease that infects anything 
> that touches them.

.sig!  .sig!

-- Dave

From crossd at gmail.com  Tue Mar 14 08:34:38 2023
From: crossd at gmail.com (Dan Cross)
Date: Mon, 13 Mar 2023 18:34:38 -0400
Subject: [COFF] [TUHS] Re: the wheel of reincarnation goes sideways
In-Reply-To: <ZA+gxAePDMWK6StD@straylight.ringlet.net>
References: <CAP6exY+05fStBtpZGd2HeeNf21fNXeKUTwBV0h5-1YczwF+tew@mail.gmail.com>
 <CAEoi9W4nMjhXvv0zeQViWwuR=PYN7eJCto_Y8W_VRqzfM0-p7w@mail.gmail.com>
 <ZA+gxAePDMWK6StD@straylight.ringlet.net>
Message-ID: <CAEoi9W69JfX4HBknZ8xEZ4chSiqzwco+681SZ93brodzAiuuxg@mail.gmail.com>

I don't know if a thousand users ever logged in there at one time, but
they do tend to have a lot of simultaneous logins.

On Mon, Mar 13, 2023 at 6:16 PM Peter Pentchev <roam at ringlet.net> wrote:
>
> On Wed, Mar 08, 2023 at 02:52:43PM -0500, Dan Cross wrote:
> > [bumping to COFF]
> >
> > On Wed, Mar 8, 2023 at 2:05 PM ron minnich <rminnich at gmail.com> wrote:
> > > The wheel of reincarnation discussion got me to thinking:
> [snip]
> > > The evolution of platforms like laptops to becoming full distributed systems continues.
> > > The wheel of reincarnation spins counter clockwise -- or sideways?
> >
> > About a year ago, I ran across an email written a decade or more prior
> > on some mainframe mailing list where someone wrote something like,
> > "wow! It just occurred to me that my Athlon machine is faster than the
> > ES/3090-600J I used in 1989!" Some guy responded angrily, rising to
> > the wounded honor of IBM, raving about how preposterous this was
> > because the mainframe could handle a thousand users logged in at one
> > time and there's no way this Linux box could ever do that.
> [snip]
> > For that matter, a
> > thousand users probably _could_ telnet into the Athlon system. With
> > telnet in line mode, it'd probably even be decently responsive.
>
> sdf.org (formerly sdf.lonestar.org) comes to mind...
>
> G'luck,
> Peter
>
> --
> Peter Pentchev  roam at ringlet.net roam at debian.org pp at storpool.com
> PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
> Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13

From ken.unix.guy at gmail.com  Sun Mar 26 08:25:36 2023
From: ken.unix.guy at gmail.com (KenUnix)
Date: Sat, 25 Mar 2023 18:25:36 -0400
Subject: [COFF] 3B2/400 Unix System V r3 man
Message-ID: <CAJXSPs_FY27GVAM3YWWYHCo5ciGms82q8oFW5T7stuqJjAq=qQ@mail.gmail.com>

Hi.

Was a man page kit ever made for Unix V r3.

I am running it under a 3B2/400 sim.

If it is available where could I get it?

Thanks,
Ken


-- 
WWL 📚
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230325/94e05657/attachment.htm>

From ken.unix.guy at gmail.com  Mon Mar 27 00:00:28 2023
From: ken.unix.guy at gmail.com (KenUnix)
Date: Sun, 26 Mar 2023 10:00:28 -0400
Subject: [COFF] Fortran Question for Unix System-V r3
Message-ID: <CAJXSPs980917_3EWeexoZU--p8u3MeP-DDHRqF0yHL79ng2usA@mail.gmail.com>

Fortran question for Unix System-5 r3.

When executing fortran programs requiring input the screen will
show a blank screen. After entering input anyway the program completes
under Unix System V *r3*.

When the same program is compiled under Unix System V *r1* it
works as expected.

Sounds like on Unix System V *r3* the output buffer is not being flushed.
I tried re-compiling F77. No help.

Fortran code follows:
      PROGRAM EASTER
      INTEGER YEAR,METCYC,CENTRY,ERROR1,ERROR2,DAY
      INTEGER EPACT,LUNA
C A PROGRAM TO CALCULATE THE DATE OF EASTER
      PRINT '(A)',' INPUT THE YEAR FOR WHICH EASTER'
      PRINT '(A)',' IS TO BE CALCULATED'
      PRINT '(A)',' ENTER THE WHOLE YEAR, E.G. 1978 '
      READ '(A)',YEAR
C CALCULATING THE YEAR IN THE 19 YEAR METONIC CYCLE-METCYC
      METCYC = MOD(YEAR,19)+1
      IF(YEAR.LE.1582)THEN
        DAY = (5*YEAR)/4
        EPACT = MOD(11*METCYC-4,30)+1
      ELSE
C CALCULATING THE CENTURY-CENTRY
      CENTRY = (YEAR/100)+1
C ACCOUNTING FOR ARITHMETIC INACCURACIES
C IGNORES LEAP YEARS ETC.
        ERROR1 = (3*CENTRY/4)-12
        ERROR2 = ((8*CENTRY+5)/25)-5
C LOCATING SUNDAY
        DAY = (5*YEAR/4)-ERROR1-10
C LOCATING THE EPACT(FULL MOON)
        EPACT = MOD(11*METCYC+20+ERROR2-ERROR1,30)
        IF(EPACT.LT.0)EPACT=30+EPACT
        IF((EPACT.EQ.25.AND.METCYC.GT.11).OR.EPACT.EQ.24)THEN
          EPACT=EPACT+1
        ENDIF
      ENDIF
C FINDING THE FULL MOON
      LUNA=44-EPACT
      IF(LUNA.LT.21)THEN
        LUNA=LUNA+30
      ENDIF
C LOCATING EASTER SUNDAY
      LUNA=LUNA+7-(MOD(DAY+LUNA,7))
C LOCATING THE CORRECT MONTH
      IF(LUNA.GT.31)THEN
        LUNA = LUNA - 31
        PRINT '(A)',' FOR THE YEAR ',YEAR
        PRINT '(A)',' EASTER FALLS ON APRIL ',LUNA
      ELSE
        PRINT '(A)',' FOR THE YEAR ',YEAR
        PRINT '(A)',' EASTER FALLS ON MARCH ',LUNA
      ENDIF
      END

Any help would be appreciated,
Ken

-- 
WWL 📚
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230326/14e6a412/attachment.htm>

From paul.winalski at gmail.com  Mon Mar 27 01:04:49 2023
From: paul.winalski at gmail.com (Paul Winalski)
Date: Sun, 26 Mar 2023 11:04:49 -0400
Subject: [COFF] Fortran Question for Unix System-V r3
In-Reply-To: <CAJXSPs980917_3EWeexoZU--p8u3MeP-DDHRqF0yHL79ng2usA@mail.gmail.com>
References: <CAJXSPs980917_3EWeexoZU--p8u3MeP-DDHRqF0yHL79ng2usA@mail.gmail.com>
Message-ID: <CABH=_VQqgayOFjaqBwDhgwPdGW7_pONiUob+X+2qbtF_-0JzPQ@mail.gmail.com>

On 3/26/23, KenUnix <ken.unix.guy at gmail.com> wrote:
> Fortran question for Unix System-5 r3.
>
> When executing fortran programs requiring input the screen will
> show a blank screen. After entering input anyway the program completes
> under Unix System V *r3*.
>
> When the same program is compiled under Unix System V *r1* it
> works as expected.
>
> Sounds like on Unix System V *r3* the output buffer is not being flushed.
> I tried re-compiling F77. No help.

Re-compiling F77 doesn't help because the bug is in the Fortran
run-time library (RTL), not in the compiler.  The routine that
implements the READ statement should be flushing the write buffer
before doing the actual read.  Clearly it isn't.

Their test system probably didn't have very many (if any) tests for
interactive behavior.  That sort of thing is difficult to automate.

-Paul W.

From ken.unix.guy at gmail.com  Tue Mar 28 23:26:12 2023
From: ken.unix.guy at gmail.com (KenUnix)
Date: Tue, 28 Mar 2023 09:26:12 -0400
Subject: [COFF] Unix V r3 question
Message-ID: <CAJXSPs9girPqm-xOk0S+VeaXbetGhp1JMZsMQd0Lw6_6Asr9WQ@mail.gmail.com>

Hi.

Does anyone have the "man" pages for Basic for System-V r3?

Thanks,
Ken

-- 
WWL 📚
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230328/3f8b68c4/attachment.htm>

From ken.unix.guy at gmail.com  Tue Mar 28 23:31:14 2023
From: ken.unix.guy at gmail.com (KenUnix)
Date: Tue, 28 Mar 2023 09:31:14 -0400
Subject: [COFF] Unix System V r1 -> Unix System V r3
Message-ID: <CAJXSPs8Ahgos4TydX8_0OdtnW-JYn51NOgDvx9U3U2MUqJ_WtQ@mail.gmail.com>

Hi,

Has anyone been successful in communicating using cu or some
other method to transfer files between two SIMS running Unix V ?

If so I would appreciate some help.

Thanks,
Ken

-- 
WWL 📚
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.tuhs.org/pipermail/coff/attachments/20230328/805a81c1/attachment.htm>

From gingell at computer.org  Thu Mar 30 09:07:57 2023
From: gingell at computer.org (Rob Gingell)
Date: Wed, 29 Mar 2023 16:07:57 -0700
Subject: [COFF] [TUHS] Re: Origins of the frame buffer device
In-Reply-To: <7w7cvr4x36.fsf@junk.nocrew.org>
References: <20230305185202.91B7B18C08D@mercury.lcs.mit.edu>
 <7w7cvr4x36.fsf@junk.nocrew.org>
Message-ID: <d1b8d289-d8ee-b7c7-65b3-1a62621902ec@computer.org>

[Redirected to COFF for some anecdotal E&S-related history and non-UNIX 
terminal room nostalgia.]

On 3/7/23 9:43 PM, Lars Brinkhoff wrote:
> Noel Chiappa wrote:
>>> The first frame buffers from Evans and Sutherland were at University
>>> of Utah, DOD SITES and NYIT CGL as I recall.  Circa 1974 to 1978.
>>
>> Were those on PDP-11's, or PDP-10's? (Really early E+S gear attached to
>> PDP-10's; '74-'78 sounds like an interim period.)
> 
> The Picture System from 1974 was based on a PDP-11/05.  It looks like
> vector graphics rather than a frame buffer though.
> 
> http://archive.computerhistory.org/resources/text/Evans_Sutherland/EvansSutherland.3D.1974.102646288.pdf

E&S LDS-1s used PDP-10s as their host systems. LDS-2s could at least in 
principle use several different hosts (including spanning a range of 
word sizes, e.g., a SEL-840 with 24 bit words or a 16 bit PDP-11.)

The Line Drawing Systems drove calligraphic displays. No frame buffers. 
The early Picture Systems (like the brochure referenced by Lars) also 
drove calligraphic displays but did sport a line segment "refresh 
buffer" so that screen refreshes weren't dependent on the whole 
pipeline's performance.

At least one heavily customized LDS-2 (described further below) produced 
raster output by 1974 (and likely earlier in design and testing) and had 
a buffer for raster refresh which exhibited some of what we think of as 
the functionality of a frame buffer fitting the time frame referenced by 
Noel for other E&S products.

On 3/8/23 10:21 AM, Larry McVoy wrote:
> I really miss terminal rooms.  I learned so much looking over the
> shoulders of more experienced people.

Completely agree. They were the "playground learning" that did all of 
educate, build craft and community, and occasionally bestow humility.

Although it completely predates frame buffer technology, the PDP-10 
terminal room of the research computing environment at CWRU in the 1970s 
was especially remarkable as well as personally influential. All 
(calligraphic) graphics terminals and displays (though later a few 
Datapoint CRTs appeared.) There was an LDS-1 hosted on the PDP-10 and 
later an LDS-2 (which was co-located but not part of the PDP-10 
environment.)

The chair of the department, Edward (Ted) Glaser, had been recruited 
from MIT in 1968 and was heavily influential in guiding the graphics 
orientation of the facilities, and later, in the design of the 
customized LDS-2. Especially remarkable as he had been blind since he 
was 8. He had a comprehensive vision of systems and thinking about them 
that influenced a lot about the department's programs and research.

When I arrived in 1972, I only had a fleeting overlap with the LDS-1 to 
experience some of its games (color wheel lorgnettes and carrier 
landings!). The PDP-10 was being modified for TENEX and the LDS-1 was 
being decommissioned. I recall a tablet and button box for LDS-1 input 
devices.

The room was kept dimly lit with the overhead lighting off and only the 
glow of the displays and small wattage desk lamps. It shared the raised 
floor environment with the PDP-10 machine room (though was walled off 
from it) and so had a "quiet-loud" aura from all the white noise. The 
white noise cocooned you but permitted conversation and interaction with 
others that didn't usually disturb the uninvolved.

The luxury terminals were IMLAC PDS-1s. There was a detachable switch 
and indicator console that could be swapped between them for debugging 
or if you simply liked having the blinking lights in view. When not in 
use for real work the IMLACs would run Space War, much to the detriment 
of IMLAC keyboards. They could handle pretty complex displays, like, a 
screen full of dense text before flicker might set in. Light pens 
provided pointing input.

The bulk of the terminals were an array of DEC VT02s. Storage tube 
displays (so no animation possible), but with joysticks for pointing and 
interacting. There were never many VT02s made and we always believed we 
had the largest single collection of them.

None of these had character generators. The LDS-1 and the IMLACs drew 
their own characters programmatically. A PDP-8/I drove the VT02s and 
stroked all the characters. It did it at about 2400 baud but when the 8 
got busy you could perceive the drawing of the characters like a scribe 
on speed. If you stood way back to take in the room you could also watch 
the PDP-8 going around as the screens brightened momentarily as the 
characters/images were drawn. I was told that CWRU wrote the software 
for the PDP-8 and gave it to DEC, in return DEC gave CWRU $1 and the 
biggest line printer they sold. (The line printer did upper and lower 
case, and the University archivists swooned when presented with theses 
printed on it -- RUNOFF being akin to magic in a typewriter primitive 
world.)

Until the Datapoint terminals arrived all the devices in the room either 
were computers themselves or front-ended by one. Although I only saw it 
happen once, the LDS-1 with it's rather intimate connection to the -10 
was particularly attuned to the status of TOPS-10 and would flash 
"CRASH" before users could tell that something was wrong vs. just being 
slow.

(We would later run TOPS-10 for amusement. The system had 128K words in 
total: 4 MA10 16K bays and 1 MD10 64K bay. TENEX needed a minimum of 80K 
to "operate" though it'd be misleading to describe that configuration as 
"running". If we lost the MD10 bay that meant no TENEX so we had a 
DECtape-swapping configuration of TOPS-10 for such moments because, 
well, a PDP-10 with 8 DECtapes twirling is pretty humorously theatrical.)

All the displays (even the later Datapoints) had green or blue-green 
phosphors. This had the side effect that after several hours of
staring at them made anything which was white look pink. This was 
especially pronounced in the winter in that being Cleveland it wasn't 
that unusual to leave to find a large deposit of seemingly psychedelic 
snow that hadn't been there when you went in.

The LDS-2 arrived in the winter of 1973-4. It was a highly modified 
LDS-2 that produced raster graphics and shaded images in real-time. It 
was the first system to do that and was called the Case Shaded Graphics 
System (SGS). (E&S called it the Halftone System as it wouldn't do color 
in real-time. In addition to a black & white raster display, It had a 
35mm movie camera, a Polaroid camera, and an RGB filter that would 
triple-expose each frame and so in a small way retained the charm of the 
lorgnettes used on the LDS-1 to make color happen but not in real-time.) 
It was hosted by a PDP-11/40 running RT-11.

Declining memory prices helped enable the innovations in the SGS as it 
incorporated more memory components than the previous calligraphic 
systems. The graphics pipeline was extended such that after translation 
and clipping there was a Y-sort box that ordered the polygons from top 
to bottom for raster scanning followed by a Visible Surface Processor 
that separated hither from yon and finally a Gouraud Shader that 
produced the final image to a monitor or one of the cameras. Physically 
the system was 5 or maybe 6 bays long not including the 11/40 bay.

The SGS had some teething problems after its delivery. Ivan Sutherland 
even came to Cleveland to work on it though he has claimed his main 
memory of that is the gunfire he heard from the Howard Johnson's hotel 
next to campus. The University was encircled by several distressed 
communities at the time. A "bullet hole through glass" decal appeared on 
the window of the SGS's camera bay to commemorate his experience.

The SGS configuration was unique but a number of its elements were 
incorporated into later Picture Systems. It's my impression that the LDS 
systems were pretty "one off" and the Picture Systems became the 
(relative) "volume, off the shelf" product from E&S. (I'd love to read a 
history of all the things E&S did in that era.)

By 1975-6 the SGS was being used by projects ranging from SST stress 
analyses to mathematicians producing videos of theoretical concepts. The 
exaggerated images of stresses on aircraft structures got pretty widely 
distributed and referenced at the time. The SGS was more of a production 
system used by other departments and entities rather than computer 
graphics research as such, in some ways its (engineering) research 
utility was achieved by its having existed. One student, Ben Jones, 
created an extended ALGOL-60 to allow programming in something other 
than the assembly language.

As the SGS came online in 1975 the PDP-10 was being decommissioned and 
the calligraphic technologies associated with it vanished along with it. 
A couple of years later a couple of Teraks appeared and by the end of 
the 1970s frame buffers as we generally think of them were economically 
practical. That along with other processing improvements rendered the 
SGS obsolete and and so it was decommissioned in 1980 and donated to the 
Computer History Museum where I imagine it sits in storage next to a 
LINC-8 or the Ark of the Covenant or something.

One of the SGS's bays (containing the LDS-2 Channel Control, the front 
of the pipeline LDS program interpreter running out of the host's 
memory) and the PDP-11 interface is visible via this link:

https://www.computerhistory.org/collections/catalog/102691213

The bezels on the E&S bays were cosmetically like the DEC ones of the 
same era. They were all smoked glass so the blinking lights were visible 
but had to be raised if you wanted to see the identifying legends for them.