Skip to content

Yotlan/Large_Scale_Data_Management

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LARGE SCALE DATA MANAGEMENT

GROUP MEMBERS

PAGES RANK : PIG VS SPARK VS SPARK + CONTROLLED PARTITIONNING

With gs://public_lddm_data/page_links_en.nt.bz2 :

Page Rank implementation Number of nodes Execution time (ms)
Pig 1 5 965 137
Pig 2 2 926 194
Pig 4 2 097 353
Pig 5 1 936 174
Spark 1 FAILED
Spark 2 2 457 349
Spark 4 1 755 194
Spark 5 1 777 790
Spark + Controlled Part. 1 FAILED
Spark + Controlled Part. 2 1 507 807
Spark + Controlled Part. 4 1 504 586
Spark + Controlled Part. 5 1 469 502

The plot for this amount of data is bellow :

page_links_en.nt.bz2 plot

To know which entity have the best pagerank score, you need to enter this command bellow :

python3 scripts/dataproc_maxentity_pagerank.py results/page_links_en.nt.bz2.out

This command return Entity '<http://dbpedia.org/resource/Living_people>' have the best pagerank with 36794.331467544755

With gs://public_lddm_data/small_page_links.nt :

Page Rank implementation Number of nodes Execution time (ms)
Pig 1 181 134
Pig 2 164 404
Pig 4 170 397
Pig 5 179 346
Spark 1 16 176
Spark 2 9 570
Spark 4 9 431
Spark 5 9 503
Spark + Controlled Part. 1 13 535
Spark + Controlled Part. 2 8 173
Spark + Controlled Part. 4 8 591
Spark + Controlled Part. 5 7 810

The plot for this amount of data is bellow :

small_page_links.nt plot

To know which entity have the best pagerank score, you need to enter this command bellow :

python3 scripts/dataproc_maxentity_pagerank.py results/small_page_links.nt.out

This command return Entity '<http://dbpedia.org/resource/Attention-deficit_hyperactivity_disorder>' have the best pagerank with 0.30051150556157313

PLOTTING RESULTS

To get theses plot, you need to enter this command bellow :

python3 scripts/dataproc_plot.py results/pagesrank.json img/

SPARQL TO PYSPARK

An example of what we want :

# select * where {
#    ?s isa Mountain . #tp1
#    ?s name ?n . #tp2
#    filter   (?n startwith "p")
# }

tp1=triples.filter(lambda x: x[1]=="<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>" \
               and  x[2]=="<https://schema.org/Mountain>").keyBy(lambda x: x[0])
tp2=triples.filter(lambda x: x[1]=="<https://schema.org/name>" \
                   and x[2].startswith("\"P")).keyBy(lambda x: x[0])
tp1.join(tp2).distinct().count()

Here you take a SPARQL query and you translate it into a PySpark query with the command bellow :

python3 scripts/sparql2pyspark.py results/query.sparql results/query.py

YOU NEED TO FOLLOW THE SCHEMA LIKE THE FILE query.sparql !

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published