Discussion:
[xquery-talk] Necessary whitespace
Benito van der Zander
2015-04-27 08:03:13 UTC
Permalink
Hi,
I just noticed that in XQuery (unlike in Pascal) 3div 1 is not a valid
expression.

Now I am wondering, which of these are valid expressions:

3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)




Benito
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Christian Grün
2015-04-27 08:12:52 UTC
Permalink
Hi Benito,
Post by Benito van der Zander
3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)
It would take some time to elaborate all the reasons for that (I would
surely need to look it up as well), but "12 div-3" is maybe easy to
explain: div-3 is also a valid name test and, thus, path expression.

Cheers,
Christian
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Ghislain Fourny
2015-04-27 09:18:28 UTC
Permalink
Hi,

I agree with Christian on the parses/doesn't parse classification.

My understanding is as follows: 3 and div are non-delimiting terminal
symbols, and hence must be separated by a whitespace.

This is specified here:

http://www.w3.org/TR/xquery-30/#id-terminal-delimitation

12!(12 div.) doesn't parse because the . after a QName requires a
whitespace (. and - are listed as exceptions in the above link). The
same applies to div-.

1<<a>2</a> doesn't parse because << would be recognized as a token.
1<<<a>2</a> parses though.

I hope it helps!

Kind regards,
Ghislain

On Mon, Apr 27, 2015 at 10:12 AM, Christian Grün
Post by Christian Grün
Hi Benito,
Post by Benito van der Zander
3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)
It would take some time to elaborate all the reasons for that (I would
surely need to look it up as well), but "12 div-3" is maybe easy to
explain: div-3 is also a valid name test and, thus, path expression.
Cheers,
Christian
_______________________________________________
http://x-query.com/mailman/listinfo/talk
_______________________________________________
***@x-query.com
http://x-quer
Michael Kay
2015-04-27 09:53:09 UTC
Permalink
Agreed.

To confuse matters, though, I see that we still have the problematic statement in A.2 "When tokenizing, the longest possible match that is consistent with the EBNF is used." This to my mind has always suggested the idea that the tokenization is sensitive to the grammatical context. And in some cases it is; you don't want to go looking for QNames or IntegerLiterals when you're in DirElementContent, just because a QName or IntegerLiteral is longer than a Char. However, it could also be read as meaning that given "12 div3", tokenizing "div3" as one token is not consistent with the EBNF (it doesn't lead to a valid parse), so it should be tokenized as two tokens. I don't think that has ever been the intent, and I guess section A.2.2 on delimiting and non-delimiting terminals was added to eliminate this interpretation.



Michael Kay
Saxonica
***@saxonica.com
+44 (0) 118 946 5893
Post by Ghislain Fourny
Hi,
I agree with Christian on the parses/doesn't parse classification.
My understanding is as follows: 3 and div are non-delimiting terminal
symbols, and hence must be separated by a whitespace.
http://www.w3.org/TR/xquery-30/#id-terminal-delimitation
12!(12 div.) doesn't parse because the . after a QName requires a
whitespace (. and - are listed as exceptions in the above link). The
same applies to div-.
1<<a>2</a> doesn't parse because << would be recognized as a token.
1<<<a>2</a> parses though.
I hope it helps!
Kind regards,
Ghislain
On Mon, Apr 27, 2015 at 10:12 AM, Christian Grün
Post by Christian Grün
Hi Benito,
Post by Benito van der Zander
3!(10---.)
12!(.div 3)
12!(12 div.)
1<<a>2</a>
12 div-3
3!(12 div-.)
It would take some time to elaborate all the reasons for that (I would
surely need to look it up as well), but "12 div-3" is maybe easy to
explain: div-3 is also a valid name test and, thus, path expression.
Cheers,
Christian
_______________________________________________
http://x-query.com/mailman/listinfo/talk
_______________________________________________
http://x-query.com/mailman/listinfo/talk
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Michael Dyck
2015-04-27 15:44:59 UTC
Permalink
Post by Michael Kay
Agreed.
To confuse matters, though, I see that we still have the problematic
statement in A.2 "When tokenizing, the longest possible match that is
consistent with the EBNF is used."
In the CR period for XQuery 3.0, we changed that sentence from
"valid in the current context"
to
"consistent with the EBNF"
(See meeting 541.)
Post by Michael Kay
This to my mind has always suggested the idea that the tokenization is
sensitive to the grammatical context. And in some cases it is; you don't
want to go looking for QNames or IntegerLiterals when you're in
DirElementContent, just because a QName or IntegerLiteral is longer than
a Char.
Right.
Post by Michael Kay
However, it could also be read as meaning that given "12 div3",
tokenizing "div3" as one token is not consistent with the EBNF (it
doesn't lead to a valid parse),
Yes, I believe that's how that sentence is supposed to be read. There are no
possible continuations of "12 div3" that conform to the EBNF, but there
*are* continuations of "12 div" that conform to the EBNF. So, when the
tokenizer is positioned just before the 'd', "div" is the longest possible
match (LPM) that is consistent with the EBNF, so the next token is "div".
Post by Michael Kay
so it should be tokenized as two tokens.
Well, that's less clear, but I think it's one valid interpretation.
Post by Michael Kay
I don't think that has ever been the intent, and I guess section A.2.2 on
delimiting and non-delimiting terminals was added to eliminate this
interpretation.
I don't think there's a problem with saying it's tokenized as two tokens.
Just because a text can be tokenized doesn't mean it's free of syntax
errors. And section A.2.2 gives just one of the many requirements that a
sequence of tokens must satisfy in order to be error-free. (Specifically,
"div" and "3" are adjacent non-delimiting terminal symbols, and so must be
separated by Whitespace and/or Comments.)

So, in that view, A.2.2 wasn't added to modify the interpretation of the LPM
rule, it was added to flag some of the cases that the LPM rule "lets through".

-Michael
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Benito van der Zander
2015-04-28 20:33:17 UTC
Permalink
Hi Michael,
Post by Michael Dyck
I don't think there's a problem with saying it's tokenized as two
tokens. Just because a text can be tokenized doesn't mean it's free of
syntax errors. And section A.2.2 gives just one of the many
requirements that a sequence of tokens must satisfy in order to be
error-free. (Specifically, "div" and "3" are adjacent non-delimiting
terminal symbols, and so must be separated by Whitespace and/or
Comments.)
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of
a MultiplicativeExpr.

Or in
1<<a>2</a>
as "<" and "<a>2</a>"

"<<" is longer, but not consistent.

Cheers,
Benito
Post by Michael Dyck
Post by Michael Kay
Agreed.
To confuse matters, though, I see that we still have the problematic
statement in A.2 "When tokenizing, the longest possible match that is
consistent with the EBNF is used."
In the CR period for XQuery 3.0, we changed that sentence from
"valid in the current context"
to
"consistent with the EBNF"
(See meeting 541.)
Post by Michael Kay
This to my mind has always suggested the idea that the tokenization is
sensitive to the grammatical context. And in some cases it is; you don't
want to go looking for QNames or IntegerLiterals when you're in
DirElementContent, just because a QName or IntegerLiteral is longer than
a Char.
Right.
Post by Michael Kay
However, it could also be read as meaning that given "12 div3",
tokenizing "div3" as one token is not consistent with the EBNF (it
doesn't lead to a valid parse),
Yes, I believe that's how that sentence is supposed to be read. There
are no possible continuations of "12 div3" that conform to the EBNF,
but there *are* continuations of "12 div" that conform to the EBNF.
So, when the tokenizer is positioned just before the 'd', "div" is the
longest possible match (LPM) that is consistent with the EBNF, so the
next token is "div".
Post by Michael Kay
so it should be tokenized as two tokens.
Well, that's less clear, but I think it's one valid interpretation.
Post by Michael Kay
I don't think that has ever been the intent, and I guess section A.2.2 on
delimiting and non-delimiting terminals was added to eliminate this
interpretation.
I don't think there's a problem with saying it's tokenized as two
tokens. Just because a text can be tokenized doesn't mean it's free of
syntax errors. And section A.2.2 gives just one of the many
requirements that a sequence of tokens must satisfy in order to be
error-free. (Specifically, "div" and "3" are adjacent non-delimiting
terminal symbols, and so must be separated by Whitespace and/or
Comments.)
So, in that view, A.2.2 wasn't added to modify the interpretation of
the LPM rule, it was added to flag some of the cases that the LPM rule
"lets through".
-Michael
_______________________________________________
http://x-query.com/mailman/listinfo/talk
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Michael Dyck
2015-04-28 21:17:05 UTC
Permalink
Post by Benito van der Zander
Hi Michael,
Post by Michael Dyck
I don't think there's a problem with saying it's tokenized as two tokens.
Just because a text can be tokenized doesn't mean it's free of syntax
errors. And section A.2.2 gives just one of the many requirements that a
sequence of tokens must satisfy in order to be error-free. (Specifically,
"div" and "3" are adjacent non-delimiting terminal symbols, and so must be
separated by Whitespace and/or Comments.)
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of a
MultiplicativeExpr.
As pointed out by Ghislain yesterday, the last paragraph of A.2.2 applies:
if a QName or NCName is followed by a "." or "-", the two tokens must be
separated by whitespace and/or Comments.
Post by Benito van der Zander
Or in
1<<a>2</a>
as "<" and "<a>2</a>"
"<<" is longer, but not consistent.
"<<" is longer than "<", and there are continuations of "1<<" that conform
to the EBNF, so the LMP rule compels the tokenizer to pick "<<", which leads
to raising an error at ">". Ghislain also said this yesterday.

It's unclear what you mean by "consistent". If you mean that having the
tokenizer pick "<<" is not consistent with parsing the string as:
1 < <a>2</a>
then, yes, that's quite true.

-Michael

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Michael Dyck
2015-04-28 21:22:01 UTC
Permalink
Post by Michael Dyck
Post by Benito van der Zander
Hi Michael,
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of a
MultiplicativeExpr.
if a QName or NCName is followed by a "." or "-", the two tokens must be
separated by whitespace and/or Comments.
Oh, sorry, right, you're saying it's not an NCName. Hm, that might be a spec
bug then.

-Michael



_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Benito van der Zander
2015-04-28 22:26:57 UTC
Permalink
Hi Michael,
Post by Benito van der Zander
Post by Michael Dyck
Post by Michael Dyck
Post by Benito van der Zander
Hi Michael,
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just
part of a
Post by Michael Dyck
Post by Michael Dyck
Post by Benito van der Zander
MultiplicativeExpr.
As pointed out by Ghislain yesterday, the last paragraph of A.2.2
if a QName or NCName is followed by a "." or "-", the two tokens
must be
Post by Michael Dyck
Post by Michael Dyck
separated by whitespace and/or Comments.
Oh, sorry, right, you're saying it's not an NCName. Hm, that might
be a spec bug then.
Yes
Post by Benito van der Zander
Post by Michael Dyck
Or in
1<<a>2</a>
as "<" and "<a>2</a>"
"<<" is longer, but not consistent.
"<<" is longer than "<", and there are continuations of "1<<" that
conform to the EBNF, so the LMP rule compels the tokenizer to pick
"<<", which leads to raising an error at ">". Ghislain also said this
yesterday.
It's unclear what you mean by "consistent". If you mean that having
Perhaps getting a consistent parsing tree?

Theoretically a parser could parse it right-to-left and see <a>2</a>
before <


Cheers,
Benito
Post by Benito van der Zander
Post by Michael Dyck
Post by Michael Dyck
Hi Michael,
What if it parses it in
12!(12 div.)
as two tokens?
"." is a terminal symbol, and "div" is not a NCName there, just part of a
MultiplicativeExpr.
if a QName or NCName is followed by a "." or "-", the two tokens must be
separated by whitespace and/or Comments.
Oh, sorry, right, you're saying it's not an NCName. Hm, that might be
a spec bug then.
-Michael
_______________________________________________
http://x-query.com/mailman/listinfo/talk
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Michael Dyck
2015-04-28 22:51:51 UTC
Permalink
Post by Benito van der Zander
Post by Michael Dyck
Post by Benito van der Zander
Or in
1<<a>2</a>
as "<" and "<a>2</a>"
"<<" is longer, but not consistent.
"<<" is longer than "<", and there are continuations of "1<<" that
conform to the EBNF, so the LMP rule compels the tokenizer to pick "<<",
which leads to raising an error at ">". Ghislain also said this yesterday.
It's unclear what you mean by "consistent". If you mean that having the
Perhaps getting a consistent parsing tree?
Well, again, if you're saying that having the tokenizer pick "<<" does not
result in a "consistent parsing tree", that's quite true, because it doesn't
result in *any* parse tree.
Post by Benito van der Zander
Theoretically a parser could parse it right-to-left and see <a>2</a> before <
Theoretically, yes. But the thinking behind the LMP rule was presumably a
left-to-right tokenization.

-Michael

_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk

Leo Studer
2015-04-27 09:49:08 UTC
Permalink
Hello

I use Oxygen with Saxon enterprise edition 9.6.05.
My output method is “text”.

The following statement does its job correctly

for $c in doc("factbook.xml")//country order by $c/@name return $c/@name/string()

However I do not understand why

for $c in doc("factbook.xml")//country order by $c/@name return $c/@name

returns an empty sequence.
What especially intrigues me is that ordering with $c/@name works fine (no conversion to string) and the output as text not.

Any hints?

Thanks in advance
Leo
_______________________________________________
tal
Michael Kay
2015-04-27 10:30:40 UTC
Permalink
At first sight I would have expected an error. It appears to fall foul of rule 7 in

http://www.w3.org/TR/xslt-xquery-serialization-31/#serdm

It is a serialization error [err:SENR0001] if an item in S6 is an attribute node, a namespace node or a function.

And indeed, Saxon reports:

SENR0001: Cannot serialize a free-standing attribute node

So I think Oxygen is suppressing this error somehow.

The question then becomes, why does the spec do that? I think the answer is that XQuery picked up the serialization spec from XSLT, and XSLT never generates free-standing attribute nodes in its result, so the problem didn't arise there.

XQ 3.1 introduces the serialization method "adaptive" which is designed to display something, without failure, regardless what you throw at it. For attributes, it shows

name="value"

not just the value, which is what you appear to want.

Michael Kay
Saxonica
***@saxonica.com
+44 (0) 118 946 5893
Post by Leo Studer
Hello
I use Oxygen with Saxon enterprise edition 9.6.05.
My output method is “text”.
The following statement does its job correctly
However I do not understand why
returns an empty sequence.
Any hints?
Thanks in advance
Leo
_______________________________________________
http://x-query.com/mailman/listinfo/talk
_______________________________________________
***@x-query.com
http://x-query.com/mailman/listinfo/talk
Loading...