Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Howo do I actually change the scale factor #28

Open
v-olmedo opened this issue May 17, 2018 · 2 comments
Open

Howo do I actually change the scale factor #28

v-olmedo opened this issue May 17, 2018 · 2 comments

Comments

@v-olmedo
Copy link

I do not see any way to do that.

@dilipbiswal
Copy link
Contributor

Hello @v-olmedo,
Thanks for trying out the code pattern. Actually this pattern is initially targeted towards developers and target platform was laptop. My thought was that data with larger scale factor may be too large for a laptop running spark. Thats why i didn't expose the scale factor. Here is the line in the code that hard-codes it to 1G at present.

 "2")  gen_data $TPCDS_ROOT_DIR '1G' ;;

You can change it to increase the scale factor. Please make sure to move the data to HDFS if you want parallelism in processing. Also you may want to partition data. I have very briefly touched up on this in the doc.

@HichamISIMA
Copy link

Hello @dilipbiswal,
You stated: "Please make sure to move the data to HDFS", does that mean that dsdgen can't generate the tables in parallel, distributed manner across a cluster that isn't HDFS? Also for the query execution with dsqgen I don't seem to get any distributed processing !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants