** 간단하게 Airflow - Spark 예제
https://github.com/yansfil/grab-data-world
https://www.slideshare.net/JoenggyuLenKim/spark-152302106
https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660
* 옛날 버전 (docker - hadoop, spark, hive)
https://github.com/big-data-europe/docker-hadoop-spark-workbench.git
hive-server:
image: bde2020/hive:2.1.0-postgresql-metastore
container_name: hive-server
env_file:
- ./hadoop-hive.env
environment:
- "HIVE_CORE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore/metastore"
ports:
- "10000:10000"
- "10002:10002"
* 최신 버전(docker - hadoop)
https://github.com/big-data-europe/docker-hadoop
* 최신 버전(docker - hadoop, hive)
https://github.com/big-data-europe/docker-hive
presto-coordinator:
image: shawnzhu/prestodb:latest
ports:
- "8080:8080"
volumes:
- ./etc:/home/presto/etc
https://devidea.tistory.com/53
# example.py
from pyspark import SparkContext
sc = SparkContext("spark://10.0.0.34:7077", "example")
sc.textFile("hdfs://10.0.0.34:8020/user/root/20191213_134923.csv")
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.csv("hdfs://10.0.0.34:8020/user/root/test.csv")
df_load = sparkSession.read.csv("hdfs://10.0.0.34:8020/user/root/20191213_134923.csv")
df_load.show()
'BigData' 카테고리의 다른 글
Crawler & Analysis Example (0) | 2019.12.13 |
---|---|
데이터 분석 사이트 (0) | 2019.12.13 |
빅데이터 이용 사례 : 카드사 (0) | 2019.12.10 |
Apache Superset (0) | 2019.12.03 |
Presto Web UI (0) | 2019.11.29 |