Discussion:
[xquery-talk] An analyze-string stumper
Joe Wicentowski
2018-04-23 16:22:40 UTC
Permalink
Hi all,

I have encountered an unexpected challenge constructing a regex for a
pattern I am looking for. I am looking for numbers in parentheses. For
example, in the following string:

"On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171)"

... I would like to match "79" and "171" (but not "UAR" or "13" or
"1968"). I have been trying to construct a regex for use with
analyze-string to capture this pattern, but I have not been successful. I
have tried the following:

analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")

In other words, there are these 3 components:

1. (?:\() a non-capturing group consisting of an open parens, followed by
2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of (a
number followed by an optional, non-matching comma-and-space), followed by
3. (?:\)) a non-capturing group consisting of a close parens

I was expecting to get the following output:

<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions
">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(<fn:group nr="1">79</fn:group>,
<fn:group nr="1">171</fn:group>)</fn:match>
</fn:analyze-string-result>

However, the actual result is that the first number ("79") is skipped, and
only the 2nd number ("171") is captured:

<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions
">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(79,
<fn:group nr="1">171</fn:group>)</fn:match>
</fn:analyze-string-result>

What am I missing? Can anyone suggest a regex that is able to capture both
numbers inside the parentheses? Or do I need to make a two-pass run
through this, finding parenthetical text with a first analyze-string like
"\(.+\)" and then looking inside its matches with a second analyze-string
like "(\d+)(?:, )?"?

Thanks,
Joe
Patrick Durusau
2018-04-23 18:50:42 UTC
Permalink
Joe,

Forgive the length but I'm likely to bump my head on this issue in the
future, so a fuller than necessary explanation:

Started with the simplest regex that would capture the parens:

1. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d.*\)")

1. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic </fn:non-match>
  <fn:match>(UAR) President Gamal Abdel Nasser, urging him to seize the
unique opportunity offered by the Jarring mission to achieve peace. (79,
171)</fn:match>
  <fn:non-match> </fn:non-match>
</fn:analyze-string-result>

OK, so what do we know about the desired matches? Digits plus (, ) with
no spaces. Yes?

2. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d, \d+\)")

So I match parens plus digits, ", " (comma plus whitespace), digits plus
paren.

2. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
  <fn:match>(79, 171)</fn:match>
  <fn:non-match> </fn:non-match>
</fn:analyze-string-result>

I need to split the two numbers and what better to do that than
alternative matching?

3. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d+ | \d+\)")

3. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79,</fn:non-match>
  <fn:match> 171)</fn:match>
  <fn:non-match> </fn:non-match>
</fn:analyze-string-result>

Your probably already laughing because you see my mistake, which I
correct in #4:

4. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d+|\d+\)")

4. Result: <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
  <fn:match>(79</fn:match>
  <fn:non-match>,</fn:non-match>
  <fn:match> 171)</fn:match>
  <fn:non-match> </fn:non-match>
</fn:analyze-string-result>

The error was here: "\(\d+ | \d+\)", which would only match (any-digit
plus a white space, whereas the number in question was followed by *no
space* and a comma.

Know thy data!

Examples created on BaseX. BTW, I started from known good examples in
XQuery Functions 3.1, verified that they worked and then created the
search strings.

Hope this helps!

Patrick
Post by Joe Wicentowski
Hi all,
I have encountered an unexpected challenge constructing a regex for a
pattern I am looking for.  I am looking for numbers in parentheses. 
  "On February 13, 1968, Secretary of State Dean Rusk sent a 
    message to Israeli Foreign Minister Abba Eban calling upon Israel to 
    endorse openly Resolution 242, and on May 13 President Johnson sent a 
    letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
    urging him to seize the unique opportunity offered by the Jarring 
    mission to achieve peace. (79, 171)"
... I would like to match "79" and "171" (but not "UAR" or "13" or
"1968").  I have been trying to construct a regex for use with
analyze-string to capture this pattern, but I have not been
  analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
  1. (?:\() a non-capturing group consisting of an open parens,
followed by
  2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of
(a number followed by an optional, non-matching comma-and-space),
followed by
  3. (?:\)) a non-capturing group consisting of a close parens
  <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk
sent a 
    message to Israeli Foreign Minister Abba Eban calling upon Israel to 
    endorse openly Resolution 242, and on May 13 President Johnson sent a 
    letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
    urging him to seize the unique opportunity offered by the Jarring 
    mission to achieve peace. </fn:non-match>
    <fn:match>(<fn:group nr="1">79</fn:group>, 
      <fn:group nr="1">171</fn:group>)</fn:match>
  </fn:analyze-string-result>
However, the actual result is that the first number ("79") is skipped,
  <fn:analyze-string-result
xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk
sent a 
    message to Israeli Foreign Minister Abba Eban calling upon Israel to 
    endorse openly Resolution 242, and on May 13 President Johnson sent a 
    letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
    urging him to seize the unique opportunity offered by the Jarring 
    mission to achieve peace. </fn:non-match>
    <fn:match>(79, 
      <fn:group nr="1">171</fn:group>)</fn:match>
  </fn:analyze-string-result>
What am I missing?  Can anyone suggest a regex that is able to capture
both numbers inside the parentheses?  Or do I need to make a two-pass
run through this, finding parenthetical text with a first
analyze-string like "\(.+\)" and then looking inside its matches with
a second analyze-string like "(\d+)(?:, )?"?
Thanks,
Joe
_______________________________________________
http://x-query.com/mailman/listinfo/talk
--
Patrick Durusau
***@durusau.net
Technical Advisory Board, OASIS (TAB)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau
Joe Wicentowski
2018-04-23 19:50:52 UTC
Permalink
Hi Patrick,

Thanks for your reply! That 4th version is certainly promising, but I
wonder, will it capture a case I have but regrettably didn't mention
explicitly: more than 2 numbers? Here's an example:

The most significant elements in the package were 18 F-104 fighters and
100 M 48 tanks. (72, 76, 77, 82, 89, 95, 99, 107, 111, 125)

Here, I've got more than 2 numbers inside the parentheses, so I can't count
on a parens to begin or end a number. I was hoping to find a pattern that
would wrap each of the numbers inside the parentheses in an <fn:group>
element, without jeopardizing inadvertent hits on numbers outside the
parentheses.

I'd take any solution or hint, but what really threw me about my attempts
was that I wasn't able to use the open and close parentheses to anchor my
search and allow arbitrary repeats of number-plus-optional-comma-and-space
"(\d+(, )?)+" within a pair of parentheses. I couldn't see why this wasn't
capturing each of the numbers within the parentheses.

Joe
Post by Patrick Durusau
Joe,
Forgive the length but I'm likely to bump my head on this issue in the
1. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. (79, 171) ", "\(\d.*\)")
1. Result: <fn:analyze-string-result xmlns:fn=
"http://www.w3.org/2005/xpath-functions"
<http://www.w3.org/2005/xpath-functions>>
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic </fn:non-match>
<fn:match>(UAR) President Gamal Abdel Nasser, urging him to seize the
unique opportunity offered by the Jarring mission to achieve peace. (79,
171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
OK, so what do we know about the desired matches? Digits plus (, ) with no
spaces. Yes?
2. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. (79, 171) ", "\(\d, \d+\)")
So I match parens plus digits, ", " (comma plus whitespace), digits plus
paren.
2. Result: <fn:analyze-string-result xmlns:fn=
"http://www.w3.org/2005/xpath-functions"
<http://www.w3.org/2005/xpath-functions>>
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. </fn:non-match>
<fn:match>(79, 171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
I need to split the two numbers and what better to do that than
alternative matching?
3. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. (79, 171) ", "\(\d+ | \d+\)")
3. Result: <fn:analyze-string-result xmlns:fn=
"http://www.w3.org/2005/xpath-functions"
<http://www.w3.org/2005/xpath-functions>>
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. (79,</fn:non-match>
<fn:match> 171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
Your probably already laughing because you see my mistake, which I correct
4. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. (79, 171) ", "\(\d+|\d+\)")
4. Result: <fn:analyze-string-result xmlns:fn=
"http://www.w3.org/2005/xpath-functions"
<http://www.w3.org/2005/xpath-functions>>
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser, urging
him to seize the unique opportunity offered by the Jarring mission to
achieve peace. </fn:non-match>
<fn:match>(79</fn:match>
<fn:non-match>,</fn:non-match>
<fn:match> 171)</fn:match>
<fn:non-match> </fn:non-match>
</fn:analyze-string-result>
The error was here: "\(\d+ | \d+\)", which would only match (any-digit
plus a white space, whereas the number in question was followed by *no
space* and a comma.
Know thy data!
Examples created on BaseX. BTW, I started from known good examples in
XQuery Functions 3.1, verified that they worked and then created the search
strings.
Hope this helps!
Patrick
Hi all,
I have encountered an unexpected challenge constructing a regex for a
pattern I am looking for. I am looking for numbers in parentheses. For
"On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171)"
... I would like to match "79" and "171" (but not "UAR" or "13" or
"1968"). I have been trying to construct a regex for use with
analyze-string to capture this pattern, but I have not been successful. I
analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
1. (?:\() a non-capturing group consisting of an open parens, followed by
2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of (a
number followed by an optional, non-matching comma-and-space), followed by
3. (?:\)) a non-capturing group consisting of a close parens
<fn:analyze-string-result xmlns:fn="
http://www.w3.org/2005/xpath-functions">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(<fn:group nr="1">79</fn:group>,
<fn:group nr="1">171</fn:group>)</fn:match>
</fn:analyze-string-result>
However, the actual result is that the first number ("79") is skipped, and
<fn:analyze-string-result xmlns:fn="
http://www.w3.org/2005/xpath-functions">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(79,
<fn:group nr="1">171</fn:group>)</fn:match>
</fn:analyze-string-result>
What am I missing? Can anyone suggest a regex that is able to capture
both numbers inside the parentheses? Or do I need to make a two-pass run
through this, finding parenthetical text with a first analyze-string like
"\(.+\)" and then looking inside its matches with a second analyze-string
like "(\d+)(?:, )?"?
Thanks,
Joe
--
Technical Advisory Board, OASIS (TAB)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau
_______________________________________________
http://x-query.com/mailman/listinfo/talk
Michael Kay
2018-04-24 07:41:40 UTC
Permalink
It would have been really nice to define fn:analyze-string() so that it would capture multiple matches of the capturing groups, but we made a policy decision that as far as possible it should be possible to implement the XPath/XQuery regex facilities using existing regex libraries, and sadly they generally do not have this capability.

But in any case I think a multi-pass approach is probably appropriate here. In fact generally, I think the approach of trying to do everything in one great complex regular expression is usually misguided. Splitting it up into smaller steps not only makes the logic easier to understand (and therefore to debug and maintain), it can also benefit performance.

So:

(a) use analyze-string to mark any substring comprising "(" followed by digits, spaces, and commas, followed by ")"

(b) use tokenize to split out the individual numbers.

Michael Kay
Saxonica
Post by Joe Wicentowski
Hi all,
"On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171)"
analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
1. (?:\() a non-capturing group consisting of an open parens, followed by
2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of (a number followed by an optional, non-matching comma-and-space), followed by
3. (?:\)) a non-capturing group consisting of a close parens
<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions <http://www.w3.org/2005/xpath-functions>">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(<fn:group nr="1">79</fn:group>,
<fn:group nr="1">171</fn:group>)</fn:match>
</fn:analyze-string-result>
<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions <http://www.w3.org/2005/xpath-functions>">
<fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a
message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. </fn:non-match>
<fn:match>(79,
<fn:group nr="1">171</fn:group>)</fn:match>
</fn:analyze-string-result>
What am I missing? Can anyone suggest a regex that is able to capture both numbers inside the parentheses? Or do I need to make a two-pass run through this, finding parenthetical text with a first analyze-string like "\(.+\)" and then looking inside its matches with a second analyze-string like "(\d+)(?:, )?"?
Thanks,
Joe
_______________________________________________
http://x-query.com/mailman/listinfo/talk
Loading...