Ihe Onwuka
2015-07-01 22:02:11 UTC
Hmmm I wonder whether this would have worked on the scraped ratings data
that I had to clean. Well I did that with XPath in XSLT, might take a look
and see.
I have 3 different movie data sets from different sources. The one I just
posted containing 3m movies, 1 has 180k movies (created by with JSONiq
running against freebase) and the other about 50k movies , all of which I
have managed to cast in XML.
So there is plenty of data to experiment with.
I look forward to your trip, I'll be around for a few months myself.
that I had to clean. Well I did that with XPath in XSLT, might take a look
and see.
I have 3 different movie data sets from different sources. The one I just
posted containing 3m movies, 1 has 180k movies (created by with JSONiq
running against freebase) and the other about 50k movies , all of which I
have managed to cast in XML.
So there is plenty of data to experiment with.
I look forward to your trip, I'll be around for a few months myself.
Ihe,
transforming XQuery to be able to do data cleaning has been a LONG desire
of mine.
Helena Galhardas was a PhD student of mine. She is now a professor in
Lisbon,
She and her students wrote the data cleaning package in Zorba â itâs 100%
clean XQuery,
so you can reuse it for other engines.
Let me know how it goes.
On the 7th I am leaving to Europe for 3-4 months.
I will certainly visit London often.
Hope we can talk, best
Dana
Ihe,
before you load anything anywhere, you need to do data cleaning on this
data
if you do integration from the Web and data has no unique idsâŠ..
In particular entity resolutionâŠ
Literature is full of data cleaning and entity resolution algorithms.
One that you will find familiar (because it looks very much like XQuery
http://www.inesc-id.pt/ficheiros/publicacoes/1259.pdf
Best regards
Dana
You will note that the data doesn't have a unique id. Title certainly
isn't unique, if you consider how many movies there have been called Batman
or Treasure Island.
Now I may encounter data about this movie from another source that covers
different facets , for example it's box office takings or movie reviews.
So it's a classic semantic web application. I want to amalgamate disparate
data about the same fact in one entity. As I said I have a transformation
that does this but it doesn't scale very well because I have to search the
entire movie base to find the best match. To overcome this I have to adopt
a mapReduce-ish approach to solve the problem.
The thinking is a graphical representation would eliminate that problem
because a graph gives me a persistent data structure already indexed for
retrieval via several different axes, whereas indexes constructed in the
XSLT transformation for the same purpose are ephemeral and would need to
be reconstructed every time you ran the transformation.
On Wed, Jul 1, 2015 at 12:46 PM, Peter Hunsberger <
transforming XQuery to be able to do data cleaning has been a LONG desire
of mine.
Helena Galhardas was a PhD student of mine. She is now a professor in
Lisbon,
She and her students wrote the data cleaning package in Zorba â itâs 100%
clean XQuery,
so you can reuse it for other engines.
Let me know how it goes.
On the 7th I am leaving to Europe for 3-4 months.
I will certainly visit London often.
Hope we can talk, best
Dana
Ihe,
before you load anything anywhere, you need to do data cleaning on this
data
if you do integration from the Web and data has no unique idsâŠ..
In particular entity resolutionâŠ
Literature is full of data cleaning and entity resolution algorithms.
One that you will find familiar (because it looks very much like XQuery
http://www.inesc-id.pt/ficheiros/publicacoes/1259.pdf
Best regards
Dana
You will note that the data doesn't have a unique id. Title certainly
isn't unique, if you consider how many movies there have been called Batman
or Treasure Island.
Now I may encounter data about this movie from another source that covers
different facets , for example it's box office takings or movie reviews.
So it's a classic semantic web application. I want to amalgamate disparate
data about the same fact in one entity. As I said I have a transformation
that does this but it doesn't scale very well because I have to search the
entire movie base to find the best match. To overcome this I have to adopt
a mapReduce-ish approach to solve the problem.
The thinking is a graphical representation would eliminate that problem
because a graph gives me a persistent data structure already indexed for
retrieval via several different axes, whereas indexes constructed in the
XSLT transformation for the same purpose are ephemeral and would need to
be reconstructed every time you ran the transformation.
On Wed, Jul 1, 2015 at 12:46 PM, Peter Hunsberger <
Should be pretty straight forward to import that into Neo4J or Titan.
Neo might be simplest, in particular via conversion of the data into JSON.
However, Titan might give you other capabilities such as using Hadoop type
processing either for import or for subsequent analytics. Without knowing
more about the business requirements can't really give you much more than
that...
Peter Hunsberger
Neo might be simplest, in particular via conversion of the data into JSON.
However, Titan might give you other capabilities such as using Hadoop type
processing either for import or for subsequent analytics. Without knowing
more about the business requirements can't really give you much more than
that...
Peter Hunsberger
I would like to convert the XML snippet below to a multi-relational
graph representation.
One way is to transform a triple store via RDF. Another which I am less
familiar with is to transform to graphML followed by a subsequent import
into some graph database tool.
The graphical representation is desirable for processing rather than
visualization reasons. Chiefly I have a matching algorthim implemented in
XSLT which works fine but doesn't scale well, a problem that I think can be
solved with a graphical representation.
I am keen to hear from my elders and betters on the subject.
<movie title="20000 lieues sous les mers">
<actors>
<person name="MéliÚs, Georges"/>
</actors>
<alias>
<title title="20,000 Leagues Under the Sea " year="1907"/>
<title title="Amid the Workings of the Deep " year="1907"/>
<title title="Deux cent mille lieues sous les mers " year="1907"/>
<title title="Le cauchemar d'un pêcheur " year="1907"/>
<title title="Under the Seas " year="1907"/>
</alias>
<directors>
<person name="MéliÚs, Georges"/>
</directors>
<genres>
<tag name="adventure"/>
<tag name="fantasy"/>
<tag name="sci-fi"/>
<tag name="short"/>
</genres>
<keywords>
<tag name="based-on-novel"/>
<tag name="dream"/>
<tag name="fish"/>
<tag name="number-in-title"/>
<tag name="submarine"/>
<tag name="undersea-monster"/>
<tag name="underwater"/>
</keywords>
<producers>
<person name="MéliÚs, Georges"/>
</producers>
</movie>
graph representation.
One way is to transform a triple store via RDF. Another which I am less
familiar with is to transform to graphML followed by a subsequent import
into some graph database tool.
The graphical representation is desirable for processing rather than
visualization reasons. Chiefly I have a matching algorthim implemented in
XSLT which works fine but doesn't scale well, a problem that I think can be
solved with a graphical representation.
I am keen to hear from my elders and betters on the subject.
<movie title="20000 lieues sous les mers">
<actors>
<person name="MéliÚs, Georges"/>
</actors>
<alias>
<title title="20,000 Leagues Under the Sea " year="1907"/>
<title title="Amid the Workings of the Deep " year="1907"/>
<title title="Deux cent mille lieues sous les mers " year="1907"/>
<title title="Le cauchemar d'un pêcheur " year="1907"/>
<title title="Under the Seas " year="1907"/>
</alias>
<directors>
<person name="MéliÚs, Georges"/>
</directors>
<genres>
<tag name="adventure"/>
<tag name="fantasy"/>
<tag name="sci-fi"/>
<tag name="short"/>
</genres>
<keywords>
<tag name="based-on-novel"/>
<tag name="dream"/>
<tag name="fish"/>
<tag name="number-in-title"/>
<tag name="submarine"/>
<tag name="undersea-monster"/>
<tag name="underwater"/>
</keywords>
<producers>
<person name="MéliÚs, Georges"/>
</producers>
</movie>