[xquery-talk] Necessary whitespace

Hi Benito,

Post by Benito van der Zander
3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)

It would take some time to elaborate all the reasons for that (I would
surely need to look it up as well), but "12 div-3" is maybe easy to
explain: div-3 is also a valid name test and, thus, path expression.

Cheers,
Christian
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Ghislain Fourny

2015-04-27 09:18:28 UTC

Hi,

I agree with Christian on the parses/doesn't parse classification.

My understanding is as follows: 3 and div are non-delimiting terminal
symbols, and hence must be separated by a whitespace.

This is specified here:

http://www.w3.org/TR/xquery-30/#id-terminal-delimitation

12!(12 div.) doesn't parse because the . after a QName requires a
whitespace (. and - are listed as exceptions in the above link). The
same applies to div-.

1<<a>2</a> doesn't parse because << would be recognized as a token.
1<<<a>2</a> parses though.

I hope it helps!

Kind regards,
Ghislain

On Mon, Apr 27, 2015 at 10:12 AM, Christian Grün

Post by Christian GrÃ¼n
Hi Benito,

Post by Benito van der Zander
3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)

It would take some time to elaborate all the reasons for that (I would
surely need to look it up as well), but "12 div-3" is maybe easy to
explain: div-3 is also a valid name test and, thus, path expression.
Cheers,
Christian
_______________________________________________
http://x-query.com/mailman/listinfo/talk

_______________________________________________
***@x-query.com
http://x-quer

Michael Kay

2015-04-27 09:53:09 UTC

Agreed.

To confuse matters, though, I see that we still have the problematic statement in A.2 "When tokenizing, the longest possible match that is consistent with the EBNF is used." This to my mind has always suggested the idea that the tokenization is sensitive to the grammatical context. And in some cases it is; you don't want to go looking for QNames or IntegerLiterals when you're in DirElementContent, just because a QName or IntegerLiteral is longer than a Char. However, it could also be read as meaning that given "12 div3", tokenizing "div3" as one token is not consistent with the EBNF (it doesn't lead to a valid parse), so it should be tokenized as two tokens. I don't think that has ever been the intent, and I guess section A.2.2 on delimiting and non-delimiting terminals was added to eliminate this interpretation.

Michael Kay
Saxonica
***@saxonica.com
+44 (0) 118 946 5893

Post by Ghislain Fourny
Hi,
I agree with Christian on the parses/doesn't parse classification.
My understanding is as follows: 3 and div are non-delimiting terminal
symbols, and hence must be separated by a whitespace.
http://www.w3.org/TR/xquery-30/#id-terminal-delimitation
12!(12 div.) doesn't parse because the . after a QName requires a
whitespace (. and - are listed as exceptions in the above link). The
same applies to div-.
1<<a>2</a> doesn't parse because << would be recognized as a token.
1<<<a>2</a> parses though.
I hope it helps!
Kind regards,
Ghislain
On Mon, Apr 27, 2015 at 10:12 AM, Christian Grün

Post by Christian GrÃ¼n
Hi Benito,

Post by Benito van der Zander
3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)

It would take some time to elaborate all the reasons for that (I would
surely need to look it up as well), but "12 div-3" is maybe easy to
explain: div-3 is also a valid name test and, thus, path expression.
Cheers,
Christian
_______________________________________________
http://x-query.com/mailman/listinfo/talk

_______________________________________________
http://x-query.com/mailman/listinfo/talk

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Michael Dyck

2015-04-27 15:44:59 UTC

Post by Michael Kay
Agreed.
To confuse matters, though, I see that we still have the problematic
statement in A.2 "When tokenizing, the longest possible match that is
consistent with the EBNF is used."

In the CR period for XQuery 3.0, we changed that sentence from
"valid in the current context"
to
"consistent with the EBNF"
(See meeting 541.)

Post by Michael Kay
This to my mind has always suggested the idea that the tokenization is
sensitive to the grammatical context. And in some cases it is; you don't
want to go looking for QNames or IntegerLiterals when you're in
DirElementContent, just because a QName or IntegerLiteral is longer than
a Char.

Right.

Post by Michael Kay
However, it could also be read as meaning that given "12 div3",
tokenizing "div3" as one token is not consistent with the EBNF (it
doesn't lead to a valid parse),

Yes, I believe that's how that sentence is supposed to be read. There are no
possible continuations of "12 div3" that conform to the EBNF, but there
*are* continuations of "12 div" that conform to the EBNF. So, when the
tokenizer is positioned just before the 'd', "div" is the longest possible
match (LPM) that is consistent with the EBNF, so the next token is "div".

Post by Michael Kay
so it should be tokenized as two tokens.

Well, that's less clear, but I think it's one valid interpretation.

Post by Michael Kay
I don't think that has ever been the intent, and I guess section A.2.2 on
delimiting and non-delimiting terminals was added to eliminate this
interpretation.

I don't think there's a problem with saying it's tokenized as two tokens.
Just because a text can be tokenized doesn't mean it's free of syntax
errors. And section A.2.2 gives just one of the many requirements that a
sequence of tokens must satisfy in order to be error-free. (Specifically,
"div" and "3" are adjacent non-delimiting terminal symbols, and so must be
separated by Whitespace and/or Comments.)

So, in that view, A.2.2 wasn't added to modify the interpretation of the LPM
rule, it was added to flag some of the cases that the LPM rule "lets through".

-Michael
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Benito van der Zander

2015-04-28 20:33:17 UTC

Hi Michael,

Post by Michael Dyck
I don't think there's a problem with saying it's tokenized as two
tokens. Just because a text can be tokenized doesn't mean it's free of
syntax errors. And section A.2.2 gives just one of the many
requirements that a sequence of tokens must satisfy in order to be
error-free. (Specifically, "div" and "3" are adjacent non-delimiting
terminal symbols, and so must be separated by Whitespace and/or
Comments.)

What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of
a MultiplicativeExpr.

Or in
1<<a>2</a>
as "<" and "<a>2</a>"

"<<" is longer, but not consistent.

Cheers,
Benito

In the CR period for XQuery 3.0, we changed that sentence from
"valid in the current context"
to
"consistent with the EBNF"
(See meeting 541.)

Right.

Post by Michael Kay
However, it could also be read as meaning that given "12 div3",
tokenizing "div3" as one token is not consistent with the EBNF (it
doesn't lead to a valid parse),

Yes, I believe that's how that sentence is supposed to be read. There
are no possible continuations of "12 div3" that conform to the EBNF,
but there *are* continuations of "12 div" that conform to the EBNF.
So, when the tokenizer is positioned just before the 'd', "div" is the
longest possible match (LPM) that is consistent with the EBNF, so the
next token is "div".

Post by Michael Kay
so it should be tokenized as two tokens.

Well, that's less clear, but I think it's one valid interpretation.

Post by Michael Kay
I don't think that has ever been the intent, and I guess section A.2.2 on
delimiting and non-delimiting terminals was added to eliminate this
interpretation.

I don't think there's a problem with saying it's tokenized as two
tokens. Just because a text can be tokenized doesn't mean it's free of
syntax errors. And section A.2.2 gives just one of the many
requirements that a sequence of tokens must satisfy in order to be
error-free. (Specifically, "div" and "3" are adjacent non-delimiting
terminal symbols, and so must be separated by Whitespace and/or
Comments.)
So, in that view, A.2.2 wasn't added to modify the interpretation of
the LPM rule, it was added to flag some of the cases that the LPM rule
"lets through".
-Michael
_______________________________________________
http://x-query.com/mailman/listinfo/talk

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Michael Dyck

2015-04-28 21:17:05 UTC

Post by Benito van der Zander
Hi Michael,

Post by Michael Dyck
I don't think there's a problem with saying it's tokenized as two tokens.
Just because a text can be tokenized doesn't mean it's free of syntax
errors. And section A.2.2 gives just one of the many requirements that a
sequence of tokens must satisfy in order to be error-free. (Specifically,
"div" and "3" are adjacent non-delimiting terminal symbols, and so must be
separated by Whitespace and/or Comments.)

What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of a
MultiplicativeExpr.

As pointed out by Ghislain yesterday, the last paragraph of A.2.2 applies:
if a QName or NCName is followed by a "." or "-", the two tokens must be
separated by whitespace and/or Comments.

Post by Benito van der Zander
Or in
1<<a>2</a>
as "<" and "<a>2</a>"
"<<" is longer, but not consistent.

"<<" is longer than "<", and there are continuations of "1<<" that conform
to the EBNF, so the LMP rule compels the tokenizer to pick "<<", which leads
to raising an error at ">". Ghislain also said this yesterday.

It's unclear what you mean by "consistent". If you mean that having the
tokenizer pick "<<" is not consistent with parsing the string as:
1 < <a>2</a>
then, yes, that's quite true.

-Michael

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Michael Dyck

2015-04-28 21:22:01 UTC

Post by Benito van der Zander
Hi Michael,
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of a
MultiplicativeExpr.

if a QName or NCName is followed by a "." or "-", the two tokens must be
separated by whitespace and/or Comments.

Oh, sorry, right, you're saying it's not an NCName. Hm, that might be a spec
bug then.

-Michael

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Benito van der Zander

2015-04-28 22:26:57 UTC

Hi Michael,

Post by Benito van der Zander
Hi Michael,
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just

part of a

Post by Benito van der Zander
MultiplicativeExpr.

As pointed out by Ghislain yesterday, the last paragraph of A.2.2
if a QName or NCName is followed by a "." or "-", the two tokens

must be

Post by Michael Dyck
separated by whitespace and/or Comments.

Oh, sorry, right, you're saying it's not an NCName. Hm, that might

be a spec bug then.

Yes

Post by Michael Dyck
Or in
1<<a>2</a>
as "<" and "<a>2</a>"
"<<" is longer, but not consistent.

Perhaps getting a consistent parsing tree?

Theoretically a parser could parse it right-to-left and see <a>2</a>
before <

Cheers,
Benito

Post by Michael Dyck
Hi Michael,
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of a
MultiplicativeExpr.

if a QName or NCName is followed by a "." or "-", the two tokens must be
separated by whitespace and/or Comments.

Oh, sorry, right, you're saying it's not an NCName. Hm, that might be
a spec bug then.
-Michael
_______________________________________________
http://x-query.com/mailman/listinfo/talk

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Michael Dyck

2015-04-28 22:51:51 UTC

Post by Benito van der Zander
Or in
1<<a>2</a>
as "<" and "<a>2</a>"
"<<" is longer, but not consistent.

Perhaps getting a consistent parsing tree?

Well, again, if you're saying that having the tokenizer pick "<<" does not
result in a "consistent parsing tree", that's quite true, because it doesn't
result in *any* parse tree.

Post by Benito van der Zander
Theoretically a parser could parse it right-to-left and see <a>2</a> before <

Theoretically, yes. But the thinking behind the LMP rule was presumably a
left-to-right tokenization.

-Michael

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Leo Studer

2015-04-27 09:49:08 UTC

Hello

I use Oxygen with Saxon enterprise edition 9.6.05.
My output method is “text”.

The following statement does its job correctly

for $c in doc("factbook.xml")//country order by $c/@name return $c/@name/string()

However I do not understand why

for $c in doc("factbook.xml")//country order by $c/@name return $c/@name

returns an empty sequence.
What especially intrigues me is that ordering with $c/@name works fine (no conversion to string) and the output as text not.

Any hints?

Thanks in advance
Leo
_______________________________________________
tal

Michael Kay

2015-04-27 10:30:40 UTC