Data_CoderKing: January 2022

Sunday, January 30, 2022

How To Read & Write CSV File Data In Local System By Using Pyspark

Index:

1. what is Csv

2. What is Spark Session

3. How to read csv file in Pyspark

4. How to Write file through Pyspark and stored in local

tools:

1. Pycharm 2021.1.3, python 3.6

2. spark 2.4.8

#code

=========================================================================

from pyspark.sql import *

from pyspark.sql import functions as F

#Initalization

spark = SparkSession.builder.master("local[2]").appName("testing").getOrCreate()

#Reading data and Create data frame

df = spark.read.option("header",'true').csv("E://YoutubebigData//csv_read//abc.csv")

df.show()

#Write data in Local System

df.write.option("header",True).csv("E://YoutubebigData//csv_read//output//abc")

=========================================================================

What is SparkSession??

SparkSession introduced in v2.0, It is an entry point to underlying PySpark functionality in order to programmatically create Pyspark RDD, DataFrame.

It's object spark is default available in pyspark-shell and it can be created programmatically using SparkSession.

SparkSession vs SparkContext –

Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster,

Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.

----------------------------------------------------------------------------------------------------------------------------

Video Link: https://www.youtube.com/watch?v=zRUWkzGosqA

Friday, January 28, 2022

Amazon S3 Basic Intro

AWS S3:

- Use Amazon S3 to store and retrieve any amount of data using highly scalable, reliable, fast, and inexpensive data storage

- A bucket is a container for objects.

- An object is a file and any metadata that describes that file.

Index:

1. Create Bucket

2. Load data File

3. Download data File

4. Delete data File

5. Delete Bucket

Video Link: https://www.youtube.com/watch?v=f9Bxe-3tTNU Video:

What is Athena ?? How its Works ??

Index:

1. setup Athena

2. Create & Select database

3. Create table

4. Load data in table

5. over of query running

Pre-required :

Know about Hive and s3

- Its similar to Hive but it is server less

- Athena is layer on top of hive

- No server required

- as similar to hive in input data no header required

- In Athena

- No charge on DDL Commands (Data Definition Language)

- It charge on scanning data only it's minimum 10mb

- Athena used or work on external table

Note:

'skip.header.line.count'='1' # for the skip header line

Video Link : https://www.youtube.com/watch?v=TzlN-aXKc_w

Video:

Data_CoderKing

Sunday, January 30, 2022

How To Read & Write CSV File Data In Local System By Using Pyspark

Friday, January 28, 2022

Amazon S3 Basic Intro

What is Athena ?? How its Works ??

2 Basic Python Program

Report Abuse