Data_CoderKing

Saturday, June 11, 2022

2 Basic Python Program

2 Basic Python Program :

1. Read and display user Inputs in Python Program

2. Sum And Average of float numbers using Python Program

1. Read and display user Inputs

id = int(input("Enter Id: "))

name = input("Enter Name: ")

sal = input("Enter Salary: ")

print(" Your Id ={} \n Name = {} \n Salary={}".format(id,name,sal))

---------------------------------------------------------------------

2. Sum And Average of float numbers

a,b,c= [float(i) for i in input ("Enter 3 number: ").split()]

sum= a+b+c

avg = sum/3

mul = a*b*c

print(" sum=%i \n Average= %.2f, mul= %i" % (sum,avg,mul))

--------------------------------------------

Even & odd check Program with UDF and without UDF

Even & odd check Program with UDF and without UDF

1. Without UDF

A= int(input("Enter Number: "))

if A==0:

print("Given Number is zero")

elif A%2==0:

print("given number is even")

else:

print("Given Number is Odd")

2. With UDF

def EvenOdd(x):

if A == 0:

print("Given Number is zero")

elif A%2== 0:

print("given number is even")

else:

print("Given Number is Odd")

A = float(input("Enter a number as input value: "))

# Printing result

EvenOdd(A)

Thursday, February 10, 2022

Difference between Partition and bucketing in Hive

Difference:

1. Partition is folder

Bucket is file

2. Go with partition when there are less number of distinct values in the column.

Go with bucketing when there are more number of distinct values in the column.

3. partition are logical division

Bucket are based on hash (here we should go with a fix no. of buckets)

4. Partition syntax:

Create table table_name(col1 datatype, col2 datatype,col3 datatype)

partition by (col4 datatype,col5 datatype);

bucketing syntax:

Create table table_name(col1 datatype, col2 datatype,col3 datatype)

partition by (col4 datatype,col5 datatype)

clustered by (col2) into 50 Buckets;

Vidoe: https://www.youtube.com/watch?v=WKpELsHZ0Zc&list=PLVt87wOZJLOdtvKLe6X846CbuFNKOX95E&index=2

Difference between Managed table and External table

Difference

1. In Managed table both the data & schema in under control to hive

In External table only schema is under control to hive

2. Managed table syntax :

Create table table_name (col datatype, col2 datatype,..,coln datatype)

row format delimited

fields terminated by ' '

stored as textfile;

External table syntax:

Create external table table_name (col datatype, col2 datatype,..,coln datatype)

row format delimited

fields terminated by ' '

stored as textfile;

3. Managed table will drop then data and metadata removed, which means underlying HDFS data will also be deleted along with metadata.

External table will drop then it remove only the metadata, which means underlying HDFS directory will remain intact.

Videos: https://www.youtube.com/watch?v=XbuWg-JRYT0&list=PLVt87wOZJLOdtvKLe6X846CbuFNKOX95E

Sunday, January 30, 2022

How To Read & Write CSV File Data In Local System By Using Pyspark

Index:

1. what is Csv

2. What is Spark Session

3. How to read csv file in Pyspark

4. How to Write file through Pyspark and stored in local

tools:

1. Pycharm 2021.1.3, python 3.6

2. spark 2.4.8

#code

=========================================================================

from pyspark.sql import *

from pyspark.sql import functions as F

#Initalization

spark = SparkSession.builder.master("local[2]").appName("testing").getOrCreate()

#Reading data and Create data frame

df = spark.read.option("header",'true').csv("E://YoutubebigData//csv_read//abc.csv")

df.show()

#Write data in Local System

df.write.option("header",True).csv("E://YoutubebigData//csv_read//output//abc")

=========================================================================

What is SparkSession??

SparkSession introduced in v2.0, It is an entry point to underlying PySpark functionality in order to programmatically create Pyspark RDD, DataFrame.

It's object spark is default available in pyspark-shell and it can be created programmatically using SparkSession.

SparkSession vs SparkContext –

Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster,

Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.

----------------------------------------------------------------------------------------------------------------------------

Video Link: https://www.youtube.com/watch?v=zRUWkzGosqA

Friday, January 28, 2022

Amazon S3 Basic Intro

AWS S3:

- Use Amazon S3 to store and retrieve any amount of data using highly scalable, reliable, fast, and inexpensive data storage

- A bucket is a container for objects.

- An object is a file and any metadata that describes that file.

Index:

1. Create Bucket

2. Load data File

3. Download data File

4. Delete data File

5. Delete Bucket

Video Link: https://www.youtube.com/watch?v=f9Bxe-3tTNU Video:

What is Athena ?? How its Works ??

Index:

1. setup Athena

2. Create & Select database

3. Create table

4. Load data in table

5. over of query running

Pre-required :

Know about Hive and s3

- Its similar to Hive but it is server less

- Athena is layer on top of hive

- No server required

- as similar to hive in input data no header required

- In Athena

- No charge on DDL Commands (Data Definition Language)

- It charge on scanning data only it's minimum 10mb

- Athena used or work on external table

Note:

'skip.header.line.count'='1' # for the skip header line

Video Link : https://www.youtube.com/watch?v=TzlN-aXKc_w

Video:

Sunday, February 28, 2021

Pig Latin word

Hi Viewer,

Below is my 1st Python Program related to Pig Latin word

Code1:

def pig_ln(word):

first_letter=word[0]

if first_letter in 'aeiou':

pigword=word+'ay'

else:

pigword=word[1:]+first_letter+'ay'

return pigword

My Introduction

Hi Viewers,

I am sharing my knowledge about my work and passionate field Data Engineering Knowledge here,

So keep supporting

Thanks,

Sachin K