LARGE SCALE DATA MANAGEMENT

GROUP MEMBERS

Yotlan LE CROM (https://github.com/Yotlan)
Endy YU (https://github.com/endyappel)
Florian MONSION (https://github.com/JayCramboui)

PAGES RANK : PIG VS SPARK VS SPARK + CONTROLLED PARTITIONNING

With gs://public_lddm_data/page_links_en.nt.bz2 :

Page Rank implementation	Number of nodes	Execution time (ms)
Pig	1	5 965 137
Pig	2	2 926 194
Pig	4	2 097 353
Pig	5	1 936 174
Spark	1	FAILED
Spark	2	2 457 349
Spark	4	1 755 194
Spark	5	1 777 790
Spark + Controlled Part.	1	FAILED
Spark + Controlled Part.	2	1 507 807
Spark + Controlled Part.	4	1 504 586
Spark + Controlled Part.	5	1 469 502

The plot for this amount of data is bellow :

To know which entity have the best pagerank score, you need to enter this command bellow :

python3 scripts/dataproc_maxentity_pagerank.py results/page_links_en.nt.bz2.out

This command return Entity '<http://dbpedia.org/resource/Living_people>' have the best pagerank with 36794.331467544755

With gs://public_lddm_data/small_page_links.nt :

Page Rank implementation	Number of nodes	Execution time (ms)
Pig	1	181 134
Pig	2	164 404
Pig	4	170 397
Pig	5	179 346
Spark	1	16 176
Spark	2	9 570
Spark	4	9 431
Spark	5	9 503
Spark + Controlled Part.	1	13 535
Spark + Controlled Part.	2	8 173
Spark + Controlled Part.	4	8 591
Spark + Controlled Part.	5	7 810

The plot for this amount of data is bellow :

To know which entity have the best pagerank score, you need to enter this command bellow :

python3 scripts/dataproc_maxentity_pagerank.py results/small_page_links.nt.out

This command return Entity '<http://dbpedia.org/resource/Attention-deficit_hyperactivity_disorder>' have the best pagerank with 0.30051150556157313

PLOTTING RESULTS

To get theses plot, you need to enter this command bellow :

python3 scripts/dataproc_plot.py results/pagesrank.json img/

SPARQL TO PYSPARK

An example of what we want :

# select * where {
#    ?s isa Mountain . #tp1
#    ?s name ?n . #tp2
#    filter   (?n startwith "p")
# }

tp1=triples.filter(lambda x: x[1]=="<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>" \
               and  x[2]=="<https://schema.org/Mountain>").keyBy(lambda x: x[0])
tp2=triples.filter(lambda x: x[1]=="<https://schema.org/name>" \
                   and x[2].startswith("\"P")).keyBy(lambda x: x[0])
tp1.join(tp2).distinct().count()

Here you take a SPARQL query and you translate it into a PySpark query with the command bellow :

python3 scripts/sparql2pyspark.py results/query.sparql results/query.py

YOU NEED TO FOLLOW THE SCHEMA LIKE THE FILE query.sparql !

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
img		img
results		results
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LARGE SCALE DATA MANAGEMENT

GROUP MEMBERS

PAGES RANK : PIG VS SPARK VS SPARK + CONTROLLED PARTITIONNING

PLOTTING RESULTS

SPARQL TO PYSPARK

About

Releases

Packages

Languages

Yotlan/Large_Scale_Data_Management

Folders and files

Latest commit

History

Repository files navigation

LARGE SCALE DATA MANAGEMENT

GROUP MEMBERS

PAGES RANK : PIG VS SPARK VS SPARK + CONTROLLED PARTITIONNING

PLOTTING RESULTS

SPARQL TO PYSPARK

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages