Sunday, January 30, 2022

How To Read & Write CSV File Data In Local System By Using Pyspark


Index:
1. what is Csv 
2. What is Spark Session
3. How to read csv file in Pyspark
4. How to Write file through Pyspark and stored in local 

tools:
1. Pycharm 2021.1.3, python 3.6
2. spark 2.4.8

#code
=========================================================================
from pyspark.sql import *
from pyspark.sql import functions as F
#Initalization 
spark = SparkSession.builder.master("local[2]").appName("testing").getOrCreate()

#Reading data and Create data frame
df = spark.read.option("header",'true').csv("E://YoutubebigData//csv_read//abc.csv")
df.show()

#Write data in Local System 
df.write.option("header",True).csv("E://YoutubebigData//csv_read//output//abc")

=========================================================================

What is SparkSession??
SparkSession introduced in v2.0, It is an entry point to underlying PySpark functionality in order to programmatically create Pyspark RDD, DataFrame.
It's object spark is default available in pyspark-shell and it can be created programmatically using SparkSession.

SparkSession vs SparkContext – 
Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster,
Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.
----------------------------------------------------------------------------------------------------------------------------
Video Link: https://www.youtube.com/watch?v=zRUWkzGosqA





Friday, January 28, 2022

Amazon S3 Basic Intro

AWS S3:
- Use Amazon S3 to store and retrieve any amount of data using highly scalable, reliable, fast, and inexpensive data storage 

 - A bucket is a container for objects. 

 - An object is a file and any metadata that describes that file.

 Index:
 1. Create Bucket 
2. Load data File 
 3. Download data File 
 4. Delete data File 
 5. Delete Bucket 


 Video Link: https://www.youtube.com/watch?v=f9Bxe-3tTNU Video:

What is Athena ?? How its Works ??

Index:


1. setup Athena
2. Create & Select database
3. Create table
4. Load data in table 
5. over of query running

Pre-required :

Know about Hive and s3


Its similar to Hive but it is server less 
- Athena is layer on top of hive 
- No server required 
- as similar to hive in input data no header required
- In Athena 
- No charge on DDL Commands (Data Definition Language)
- It charge on scanning data only it's minimum 10mb 
- Athena used or work on external table 

Note:

'skip.header.line.count'='1'   # for the skip header line 

Video Link : https://www.youtube.com/watch?v=TzlN-aXKc_w

Video:




2 Basic Python Program

  2 Basic Python Program : 1. Read and display user Inputs in Python Program  2.  Sum And Average of float numbers  using Python Program    ...