[xquery-talk] outer join between 2 sequences

Discussion:

Ihe Onwuka

2014-09-28 10:57:53 UTC

I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence A
and I want the outer join where sequence B is "null".

Should one expect bad performance in sequence B is not sorted and can one
expect reasonable performance if it is?

I am guessing that performance will vary by implemetation.

Adam Retter

2014-09-28 11:32:42 UTC

Permalink

Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence A and
I want the outer join where sequence B is "null".

There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -

$a[. = $b]

Post by Ihe Onwuka
Should one expect bad performance in sequence B is not sorted and can one
expect reasonable performance if it is?

I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent. I am not sure
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.

Post by Ihe Onwuka
I am guessing that performance will vary by implemetation.

Of course!

Post by Ihe Onwuka
_______________________________________________
http://x-query.com/mailman/listinfo/talk

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Ihe Onwuka

2014-09-28 11:42:43 UTC

Permalink

Post by Adam Retter

Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence A

and

Post by Ihe Onwuka
I want the outer join where sequence B is "null".

There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -
$a[. = $b]

I meant null in the SQL sense of outer join - sorry.

I want to drop the things that are in B from A where both B and A are just
sequences of integers. In other words, don't fetch what I've already got.

Post by Adam Retter

Post by Ihe Onwuka
Should one expect bad performance in sequence B is not sorted and can one
expect reasonable performance if it is?

I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent.

I am not sure

Post by Adam Retter
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.

that's good - the nightmare scenario is a O^n2 algorithm if B is not sorted

Ihe Onwuka

2014-09-28 11:55:00 UTC

Permalink

PPS

Post by Adam Retter

Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence

A and

Post by Ihe Onwuka
I want the outer join where sequence B is "null".

There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -
$a[. = $b]

Of course that would be $a[not(. = $b)] maybe even $a except $b, I was
asking more on what the performance expectation would be.

Pleased that it reasonable to expect it to be fast.

Ihe Onwuka

2014-09-28 11:56:22 UTC

Permalink

Post by Ihe Onwuka

Post by Adam Retter

Post by Ihe Onwuka
I have sequence A consisting of the numbers 0 to 4000000 and sequnce B
consisting of about 100k random I numbers within the range of sequence

A and

Post by Ihe Onwuka
I want the outer join where sequence B is "null".

There is no 'null' in XQuery, so I am not quite sure what you mean
here. If you ware looking for all values that appear in sequence A and
sequence B, then you can do the following -
$a[. = $b]

I meant null in the SQL sense of outer join - sorry.
I want to drop the things that are in B from A where both B and A are just
sequences of integers. In other words, don't fetch what I've already got.

Pah... incorrect snippage in the last post.

Of course that would be $a[not(. = $b)] maybe even $a except $b, I was
asking more on what the performance expectation would be.

Pleased that it reasonable to expect it to be fast.

Adam Retter

2014-09-28 11:57:19 UTC

Permalink

Post by Ihe Onwuka
I want to drop the things that are in B from A where both B and A are just
sequences of integers. In other words, don't fetch what I've already got.

(1,2,3)[fn:not(. = (2,4))]

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Adam Retter

2014-09-28 12:29:26 UTC

Permalink

Post by Adam Retter
I think that question is very implementation specific. If all of your
data is in RAM, as your dataset is relatively small and these are just
numbers, I would expect performance to be excellent. I am not sure
that sorting will make much of a difference, but it depends on the
implementation and how it initiates the search for a false comparison
in a large sequence.

I retract any previous statements that may have alluded to good
performance. A quick test on Saxon and eXist shows that this is a very
slow problem. I am trying to see if there is not a short-cut that
could be taken, I will come back to you...

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Adam Retter

2014-09-28 13:13:28 UTC

Permalink

So after a bit more coffee and a bit of research it seems to me that
the only way you are going to get this to be fast would be if you used
a hash based looked for one of your sequences. Something like a
HashMap or BloomFilter would do the job, see:
http://stackoverflow.com/questions/4261619/fastest-set-operations-in-the-west

I have made some experimentations with the above in Scala, a HashMap
works nicely here in-terms of performance because it is simple and
available, and in your case with Integers you do not need to worry
about duplicates because the value of a number is it's identity.

So, if you don't want to implement a HashMap or BloomFilter in XQuery,
what is one to do? Well XQuery 3.1 will introduce the Map data-type
and some implementations already have done this. If you understand the
implementation (or even hazard a guess), there is a good chance that
that XQuery map may in-fact under the covers be a HashMap.

I won't say too much more about this as I have been discussing it with
Wolfgang this morning, and I think he will shortly post you a very
fast example when using XQuery 3.1 Maps in eXist...

Post by Adam Retter

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

w***@exist-db.org

2014-09-28 13:25:11 UTC

Permalink

Well, Michael already posted something along those lines, but heres my solution based on the public 3.1 draft:

xquery version "3.0";

let $a := 1 to 1000000
let $b := map:new((1 to 100) ! map:entry(xs:int(util:random() * 100000), 0))
return
$a[map:contains($b, .)]

Wolfgang

Ihe Onwuka

2014-09-28 15:52:19 UTC

Permalink

Post by Adam Retter
I won't say too much more about this as I have been discussing it with
Wolfgang this morning, and I think he will shortly post you a very
fast example when using XQuery 3.1 Maps in eXistâŠ
Well, Michael already posted something along those lines, but hereâs my
xquery version "3.0";
let $a := 1 to 1000000
let $b := map:new((1 to 100) ! map:entry(xs:int(util:random() * 100000), 0))
return
$a[map:contains($b, .)]
Wolfgang

Would this not be $a[not(map:contains($b, .))] since I want the ones for
which there is no matching sequence B.

Ihe Onwuka

2014-09-28 15:51:04 UTC

Permalink

Post by Adam Retter
So after a bit more coffee and a bit of research it seems to me that
the only way you are going to get this to be fast would be if you used
a hash based looked for one of your sequences. Something like a
http://stackoverflow.com/questions/4261619/fastest-set-operations-in-the-west

Hmmmmm well I have already committed myself before your retraction and the
thing is running now so am just going to leave it.

This is only the tip of the problem because each integer that survives
selection generates an HTTP request that either returns a 404 or leads to
an HTTP Put into eXist. So if I was really in a hurry I'd have to start
looking at mapReducing the keys. I'm not even sure that is the answer as
all the mapreduce jobs would be pounding the same server (mind you the
site may load balance to mitigate that).

Another alternative solution is to just export sequence B into SQLite,
index both sequences (now they are tables) and do it in SQL.

Michael Kay

2014-09-28 13:10:56 UTC

Permalink

Note that maps give you the opportunity to hand-optimize things like this if your product's optimizer doesn't find a way to do it automatically.

let $A := a set of numbers
let $B := another set of numbers
let $map := fold-left($B, map{}, map:put(?, ?, true()))
return $A[empty($map(.))]

Michael Kay
Saxonica
***@saxonica.com
+44 (0) 118 946 5893

I have sequence A consisting of the numbers 0 to 4000000 and sequnce B consisting of about 100k random I numbers within the range of sequence A and I want the outer join where sequence B is "null".
Should one expect bad performance in sequence B is not sorted and can one expect reasonable performance if it is?
I am guessing that performance will vary by implemetation.
_______________________________________________
http://x-query.com/mailman/listinfo/talk

Continue reading on narkive:

Search results for '[xquery-talk] outer join between 2 sequences' (Questions and Answers)

replies

Christians will you join Satan with me?

started 2015-07-31 21:04:02 UTC

religion & spirituality

replies

How do they get those "pictures" from outer space, of earth, from past the moon?

started 2010-07-02 08:40:12 UTC

astronomy & space

replies

ap bio question. sequence of the events of translation (listed in question)?