USDT BUY BACK

After finding the successful order, this system continues generating different lucky number and the game continue as above Step 1: Log in Nami.Exchange account. Exactly 8:08pm, this system is going…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




10 Questions you can expect in Spark Interview

Questions frequently asked in Apache Spark Interviews

Hey Fellas,

Data Engineer position is highly in demand in recent times, with Apache Spark being state of the art for Batch processing and ETL, being cognizant in it can easily land you a job as a Data Engineer. So, in this article I will be showcasing 10 questions you can expect in Apache Spark Interview, please note that I won’t be including naive questions like “What is Dataframe?”, “What is Spark RDD?” or “How to read/write orc file?” as I expect that an associate going for a job interview for Apache Spark would be knowing these things already and reiterating them again is pointless.

So all these said, let’s jump to Q/A.

Yes Spark is evidently better than Hadoop, one of the major reason is it is faster than Hadoop because of in memory processing which helps reduce latency for read/write operations. Basically when we use map reduce paradigm, on completion of each task there will be write on disk and when the data has to be used again, read will be performed again. But, in Spark the processing will be done in memory and the dataframes are cached for future use which results in increased performance. Moreover, Spark comes with libraries like Spark ML, Spark SQL, Spark Streaming which makes it more rich.

This is the hotshot topic of discussion when it comes to optimising your spark job. Both functions basically allows us to manipulate the number of partitions of our dataframe, but there uses are different. Repartition will do full shuffle on the data so we can increase or decrease the number of partitions, but coalesce will just shift you data from one partition to other binding us to only decrease number of partitions using it. Coalesce will be faster as shuffle will be less but if the number of partitions has to be increased or the data is skewed and we want to decrease number of partition by reshuffling, then we should go with repartition method.

Broadcast join is also used for optimizing Spark Job(particularly joins). When small sized dataframe is joined with relatively larger dataframe, we can broadcast small dataframe which will send a copy of the small dataframe to each node which will result in faster join execution and less shuffling. Syntax is given below.

When broadcasting smaller dataframe, we can reduce its partition to 1 for better performance(depending on your use-case).

There are two important aspects in Apache Spark, one is action and second is transformation. Transformation includes functions like filter,where,when, on calling these functions Spark does not actually performs those transformations but are stacked until an action is called. When an action is called all the transformations are executed at that point, this helps Apache Spark to optimise the performance of the job. Example of actions are show(), count(), collect().

This a sql question but I included it because we can expect this question if we go in window-partition section. Suppose, we have a dataset as given below:

Solution to this is to copy your hive-site.xml and core-site.xml in spark conf folder which will give Spark job all the required metadata about Hive metastore and you have to enable Hive Support along with specifying your warehouse directory location of Hive in configuration while starting your Spark Session as given below:

Then to read the we can specify schema and root tag during read as follows:

This will give you dataframe with “name” and “address” as columns.

So, that’s all folks hope you find my article helpful. Do checkout my previous article on Spark Delta in which I have explained ACID on spark. Till then ta ta!

Add a comment

Related posts:

Top 5 Lucrative Ways to Make Money Online

In the present advanced age, bringing in cash online has become more straightforward than any time in recent memory. With innumerable chances to make money, many individuals are going to the web to…

Gateway to Hell

News stories within. “Gateway to Hell” is published by Fuzzy Logik.

Macro Marketing Does not Cost Too Much

Macro marketing is the influence in which marketing policies and strategies have on an economy and society as whole. It focuses more on the consumer rather than the individual. The goal of macro…