Discussion:
[xquery-talk] outer join between 2 sequences
Ihe Onwuka
2014-09-28 10:57:53 UTC
Permalink
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence A
and I want the outer join where sequence B is "null".

Should one expect bad performance in sequence B is not sorted and can one
expect reasonable performance if it is?

I am guessing that performance will vary by implemetation.
Adam Retter
2014-09-28 11:32:42 UTC
Permalink
Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence A and
I want the outer join where sequence B is "null".
There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -

$a[. = $b]
Post by Ihe Onwuka
Should one expect bad performance in sequence B is not sorted and can one
expect reasonable performance if it is?
I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent. I am not sure
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.
Post by Ihe Onwuka
I am guessing that performance will vary by implemetation.
Of course!
Post by Ihe Onwuka
_______________________________________________
http://x-query.com/mailman/listinfo/talk
--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Ihe Onwuka
2014-09-28 11:42:43 UTC
Permalink
Post by Adam Retter
Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence A
and
Post by Ihe Onwuka
I want the outer join where sequence B is "null".
There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -
$a[. = $b]
I meant null in the SQL sense of outer join - sorry.

I want to drop the things that are in B from A where both B and A are just
sequences of integers. In other words, don't fetch what I've already got.
Post by Adam Retter
Post by Ihe Onwuka
Should one expect bad performance in sequence B is not sorted and can one
expect reasonable performance if it is?
I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent.
I am not sure
Post by Adam Retter
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.
that's good - the nightmare scenario is a O^n2 algorithm if B is not sorted
Ihe Onwuka
2014-09-28 11:55:00 UTC
Permalink
PPS
Post by Adam Retter
Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence
A and
Post by Ihe Onwuka
I want the outer join where sequence B is "null".
There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -
$a[. = $b]
Of course that would be $a[not(. = $b)] maybe even $a except $b, I was
asking more on what the performance expectation would be.

Pleased that it reasonable to expect it to be fast.
Ihe Onwuka
2014-09-28 11:56:22 UTC
Permalink
Post by Ihe Onwuka
Post by Adam Retter
Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence
A and
Post by Ihe Onwuka
I want the outer join where sequence B is "null".
There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -
$a[. = $b]
I meant null in the SQL sense of outer join - sorry.
I want to drop the things that are in B from A where both B and A are just
sequences of integers. In other words, don't fetch what I've already got.
Pah... incorrect snippage in the last post.


Of course that would be $a[not(. = $b)] maybe even $a except $b, I was
asking more on what the performance expectation would be.

Pleased that it reasonable to expect it to be fast.
Adam Retter
2014-09-28 11:57:19 UTC
Permalink
Post by Ihe Onwuka
I want to drop the things that are in B from A where both B and A are just
sequences of integers. In other words, don't fetch what I've already got.
(1,2,3)[fn:not(. = (2,4))]
--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Adam Retter
2014-09-28 12:29:26 UTC
Permalink
Post by Adam Retter
I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent. I am not sure
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.
I retract any previous statements that may have alluded to good
performance. A quick test on Saxon and eXist shows that this is a very
slow problem. I am trying to see if there is not a short-cut that
could be taken, I will come back to you...
--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Adam Retter
2014-09-28 13:13:28 UTC
Permalink
So after a bit more coffee and a bit of research it seems to me that
the only way you are going to get this to be fast would be if you used
a hash based looked for one of your sequences. Something like a
HashMap or BloomFilter would do the job, see:
http://stackoverflow.com/questions/4261619/fastest-set-operations-in-the-west

I have made some experimentations with the above in Scala, a HashMap
works nicely here in-terms of performance because it is simple and
available, and in your case with Integers you do not need to worry
about duplicates because the value of a number is it's identity.

So, if you don't want to implement a HashMap or BloomFilter in XQuery,
what is one to do? Well XQuery 3.1 will introduce the Map data-type
and some implementations already have done this. If you understand the
implementation (or even hazard a guess), there is a good chance that
that XQuery map may in-fact under the covers be a HashMap.

I won't say too much more about this as I have been discussing it with
Wolfgang this morning, and I think he will shortly post you a very
fast example when using XQuery 3.1 Maps in eXist...
Post by Adam Retter
Post by Adam Retter
I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent. I am not sure
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.
I retract any previous statements that may have alluded to good
performance. A quick test on Saxon and eXist shows that this is a very
slow problem. I am trying to see if there is not a short-cut that
could be taken, I will come back to you...
--
Adam Retter
skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
w***@exist-db.org
2014-09-28 13:25:11 UTC
Permalink
Post by Adam Retter
I won't say too much more about this as I have been discussing it with
Wolfgang this morning, and I think he will shortly post you a very
fast example when using XQuery 3.1 Maps in eXist…
Well, Michael already posted something along those lines, but here’s my solution based on the public 3.1 draft:

xquery version "3.0";

let $a := 1 to 1000000
let $b := map:new((1 to 100) ! map:entry(xs:int(util:random() * 100000), 0))
return
$a[map:contains($b, .)]

Wolfgang
Ihe Onwuka
2014-09-28 15:52:19 UTC
Permalink
Post by Adam Retter
I won't say too much more about this as I have been discussing it with
Wolfgang this morning, and I think he will shortly post you a very
fast example when using XQuery 3.1 Maps in eXist

Well, Michael already posted something along those lines, but here’s my
xquery version "3.0";
let $a := 1 to 1000000
let $b := map:new((1 to 100) ! map:entry(xs:int(util:random() * 100000), 0))
return
$a[map:contains($b, .)]
Wolfgang
Would this not be $a[not(map:contains($b, .))] since I want the ones for
which there is no matching sequence B.

Ihe Onwuka
2014-09-28 15:51:04 UTC
Permalink
Post by Adam Retter
So after a bit more coffee and a bit of research it seems to me that
the only way you are going to get this to be fast would be if you used
a hash based looked for one of your sequences. Something like a
http://stackoverflow.com/questions/4261619/fastest-set-operations-in-the-west
Hmmmmm well I have already committed myself before your retraction and the
thing is running now so am just going to leave it.

This is only the tip of the problem because each integer that survives
selection generates an HTTP request that either returns a 404 or leads to
an HTTP Put into eXist. So if I was really in a hurry I'd have to start
looking at mapReducing the keys. I'm not even sure that is the answer as
all the mapreduce jobs would be pounding the same server (mind you the
site may load balance to mitigate that).

Another alternative solution is to just export sequence B into SQLite,
index both sequences (now they are tables) and do it in SQL.
Michael Kay
2014-09-28 13:10:56 UTC
Permalink
Note that maps give you the opportunity to hand-optimize things like this if your product's optimizer doesn't find a way to do it automatically.

let $A := a set of numbers
let $B := another set of numbers
let $map := fold-left($B, map{}, map:put(?, ?, true()))
return $A[empty($map(.))]

Michael Kay
Saxonica
***@saxonica.com
+44 (0) 118 946 5893
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B consisting of about 100k random I numbers within the range of sequence A and I want the outer join where sequence B is "null".
Should one expect bad performance in sequence B is not sorted and can one expect reasonable performance if it is?
I am guessing that performance will vary by implemetation.
_______________________________________________
http://x-query.com/mailman/listinfo/talk
Continue reading on narkive:
Loading...