Discussion:
[xquery-talk] backticks in regex - tales of the unexpected part II
Ihe Onwuka
2014-04-07 16:07:21 UTC
Permalink
backticks match the \w regex class which does seem at odds with the
definition of that class.
Ihe Onwuka
2014-04-07 16:09:57 UTC
Permalink
to put that another way why is a backtick (matches \w) deemed more
wordy than a quote which doesn't match \w.
Post by Ihe Onwuka
backticks match the \w regex class which does seem at odds with the
definition of that class.
David Carlisle
2014-04-07 16:21:31 UTC
Permalink
Post by Ihe Onwuka
to put that another way why is a backtick (matches \w) deemed more
wordy than a quote which doesn't match \w.
You cross posted to the wrong lists really, regex syntax is as defined
by schema, not by xsl or xquery, and that defines \w as

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of
"punctuation", "separator" and "other" characters)


By backtick I assume you mean U+0060 [`] which isn't a quotation mark,
it's a grave accent and has unicode class Sk so isn't punctuation,
separator or other. (Sk is "symbols")

David
Ihe Onwuka
2014-04-07 16:36:29 UTC
Permalink
Just going by the definition of the \w class in MK's XPath 2.0
reference - \w -> a character considered to form part of a word

So it's TS if backtick isn't a word character in your vocabulary.
Probably neither the first or the last to get caught by that one.
Post by David Carlisle
Post by Ihe Onwuka
to put that another way why is a backtick (matches \w) deemed more
wordy than a quote which doesn't match \w.
You cross posted to the wrong lists really, regex syntax is as defined
by schema, not by xsl or xquery, and that defines \w as
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of
"punctuation", "separator" and "other" characters)
By backtick I assume you mean U+0060 [`] which isn't a quotation mark,
it's a grave accent and has unicode class Sk so isn't punctuation,
separator or other. (Sk is "symbols")
David
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
--~--
David Carlisle
2014-04-07 16:49:52 UTC
Permalink
Post by Ihe Onwuka
Just going by the definition of the \w class in MK's XPath 2.0
reference - \w -> a character considered to form part of a word
It isn't (and can't) mean a character that _you_ consider a word, since
it's not a user (or even locale) dependent expression. So it has to mean
what someone considered (very loosely) to be a "word". Including spacing
accents in that list doesn't seem unreasonable.
Post by Ihe Onwuka
So it's TS if backtick isn't a word character in your vocabulary.
No just that if you are writing vocabulary specific regex you need to
use vocabulary specific regex terms. If I'm looking for words in English
I tend to use [a-z] even if some people try to sneak accents into cafe
or naive :-)

David
Ihe Onwuka
2014-04-07 17:04:11 UTC
Permalink
Post by David Carlisle
No just that if you are writing vocabulary specific regex you need to
use vocabulary specific regex terms. If I'm looking for words in English
I tend to use [a-z] even if some people try to sneak accents into cafe
or naive :-)
Well mine is not a regional vocabulary scenario. The backtick appears
in a title which is used to create a url which (I believe) will not
tolerate such characters.
David Carlisle
2014-04-07 18:32:33 UTC
Permalink
Post by Ihe Onwuka
Post by David Carlisle
No just that if you are writing vocabulary specific regex you need
to use vocabulary specific regex terms. If I'm looking for words
in English I tend to use [a-z] even if some people try to sneak
accents into cafe or naive :-)
Well mine is not a regional vocabulary scenario. The backtick
appears in a title which is used to create a url which (I believe)
will not tolerate such characters.
well then grave accent is the least of your concerns with \w

URI letters are defined as ALPHA (%41-%5A and %61-%7A) ie [a-zA-Z] so
doesn't allow accented letters, or Greek or Cyrillic or 10s of thousands
of other characters included in \w

https://tools.ietf.org/html/rfc3986

Of course most user-facing systems such as html or XML allow a much
wider set of characters in href attributes and SYSTEM identifiers and
leave it to the system to %-encode according to the somewhat arcane URI
rules, cf IRI or LEIRI syntax.

David
Michael Kay
2014-04-08 17:20:26 UTC
Permalink
Post by Ihe Onwuka
backticks match the \w regex class which does seem at odds with the
definition of that class.
You might call it a backtick, and misuse it as a kind of quotation mark, but its proper Unicode name and intended semantics is "grave accent", and the \w category includes all non-spacing diacriticals.

Michael Kay
Saxonica
Ihe Onwuka
2014-04-08 17:28:48 UTC
Permalink
it and every other backtick in the dataset I am dealing with is a
mistyped quotation mark.

Exhibit 1

Aisha`s Song but is supposed to be referring to
http://www.imdb.com/title/tt1950067/
Post by Michael Kay
Post by Ihe Onwuka
backticks match the \w regex class which does seem at odds with the
definition of that class.
You might call it a backtick, and misuse it as a kind of quotation mark, but its proper Unicode name and intended semantics is "grave accent", and the \w category includes all non-spacing diacriticals.
Michael Kay
Saxonica
Loading...