Databricks Certified Associate Developer for Apache Spark 3.5 - Python Associate-Developer-Apache-Spark-3.5 Prüfungsfragen mit Lösungen:
1. 23 of 55.
A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.
Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)
A) It eliminates the need to write any code, automatically handling all data processing.
B) It processes data solely on disk storage, reducing the need for memory resources.
C) It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.
D) It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.
E) It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.
2. An application architect has been investigating Spark Connect as a way to modernize existing Spark applications running in their organization.
Which requirement blocks the adoption of Spark Connect in this organization?
A) Stability: isolation of application code and dependencies from each other and the Spark driver
B) Debuggability: the ability to perform interactive debugging directly from the application code
C) Upgradability: the ability to upgrade the Spark applications independently from the Spark driver itself
D) Complete Spark API support: the ability to migrate all existing code to Spark Connect without modification, including the RDD APIs
3. A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.
The schema of the user profile table looks like this:
Which block of Spark code can be used to achieve this requirement?
Options:
A) filtered_df = users_raw_df.na.drop(thresh=0)
B) filtered_df = users_raw_df.na.drop(how='all', thresh=None)
C) filtered_df = users_raw_df.na.drop(how='any')
D) filtered_df = users_raw_df.na.drop(how='all')
4. A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.
Which change should be made to solve the issue?
A) Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy
B) Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy
C) Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges
D) Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges
5. A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:
fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid')) The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.
Which change should be made to the code to stop these customer records from being dropped?
A) fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')
B) fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')
C) fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))
D) fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))
Fragen und Antworten:
| 1. Frage Antwort: D,E | 2. Frage Antwort: D | 3. Frage Antwort: C | 4. Frage Antwort: B | 5. Frage Antwort: A |






774 Kundenbewertungen

