Verified Databricks-Certified-Professional-Data-Engineer Dumps Q&As - Databricks-Certified-Professional-Data-Engineer Test Engine with Correct Answers [Q50-Q65]

Verified Databricks-Certified-Professional-Data-Engineer Dumps Q&As - Databricks-Certified-Professional-Data-Engineer Test Engine with Correct Answers

Pass Your Databricks-Certified-Professional-Data-Engineer Dumps as PDF Updated on 2024 With 98 Questions

NEW QUESTION # 50
What is the main difference between the silver layer and the gold layer in medalion architecture?

A. Silver may contain aggregated data
B. God is a copy of silver data
C. Silver is a copy of bronze data
D. Data quality checks are applied in gold
E. Gold may contain aggregated data

Answer: E

Explanation:
Explanation
Medallion Architecture - Databricks
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.
A diagram of a house Description automatically generated with low confidence

NEW QUESTION # 51
A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch nameddev-2.3.9is not available from the branch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?

A. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
B. Merge all changes back to the main branch in the remote Git repository and clone the repo again
C. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository
D. Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
E. Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

Answer: D

Explanation:
Explanation
This is the correct answer because it will allow the developer to update their local repository with the latest changes from the remote repository and switch to the desired branch. Pulling changes will not affect the current branch or create any conflicts, as it will only fetch the changes and not merge them. Selecting the dev-2.3.9 branch from the dropdown will checkout that branch and display its contents in the notebook.
Verified References: [Databricks Certified Data Engineer Professional], under "Databricks Tooling" section; Databricks Documentation, under "Pull changes from a remote repository" section.

NEW QUESTION # 52
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

A. No; Delta Lake manages streaming checkpoints in the transaction log.
B. No; each of the streams needs to have its own checkpoint directory.
C. Yes; Delta Lake supports infinite concurrent writers.
D. Yes; both of the streams can share a single checkpoint directory.
E. No; only one stream can write to a Delta Lake table.

Answer: B

Explanation:
This is the correct answer because checkpointing is a critical feature of Structured Streaming that provides fault tolerance and recovery in case of failures. Checkpointing stores the current state and progress of a streaming query in a reliable storage system, such as DBFS or S3. Each streaming query must have its own checkpoint directory that is unique and exclusive to that query. If two streaming queries share the same checkpoint directory, they will interfere with each other and cause unexpected errors or data loss. Verified References: [Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "Checkpointing" section.

NEW QUESTION # 53
The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table namedusers.

Assuming thatuser_idis a unique identifying key and that contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

A. Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.
B. Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.
C. No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.
D. No; the Delta Lake delete command only provides ACID guarantees when combined with the mergeinto command.
E. No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

Answer: C

Explanation:
Explanation
The code uses the DELETE FROM command to delete records from the users table that match a condition based on a join with another table called delete_requests, which contains all users that have requested deletion.
The DELETE FROM command deletes records from a Delta Lake table by creating a new version of the table that does not contain the deleted records. However, this does not guarantee that the records to be deleted are no longer accessible, because Delta Lake supports time travel, which allows querying previous versions of the table using a timestamp or version number. Therefore, files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files from physical storage.
Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Delete from a table" section; Databricks Documentation, under "Remove files no longer referenced by a Delta table" section.

NEW QUESTION # 54
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A. configure
B. workspace
C. jobs
D. libraries
E. fs

Answer: D

Explanation:
Explanation
The libraries command group allows you to install, uninstall, and list libraries on Databricks clusters. You can use the libraries install command to install a custom Python Wheel on a cluster by specifying the --whl option and the path to the wheel file. For example, you can use the following command to install a custom Python Wheel named mylib-0.1-py3-none-any.whl on a cluster with the id 1234-567890-abcde123:
databricks libraries install --cluster-id 1234-567890-abcde123 --whl
dbfs:/mnt/mylib/mylib-0.1-py3-none-any.whl
This will upload the custom Python Wheel to the cluster and make it available for use with a production job.
You can also use the libraries uninstall command to uninstall a library from a cluster, and the libraries list command to list the libraries installed on a cluster.
References:
Libraries CLI (legacy): https://docs.databricks.com/en/archive/dev-tools/cli/libraries-cli.html Library operations: https://docs.databricks.com/en/dev-tools/cli/commands.html#library-operations Install or update the Databricks CLI: https://docs.databricks.com/en/dev-tools/cli/install.html

NEW QUESTION # 55
The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.
Which approach will ensure that this requirement is met?

A. When data is saved to a table, make sure that a full file path is specified alongside the Delta format.
B. When a database is being created, make sure that the LOCATION keyword is used.
C. When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.
D. When the workspace is being configured, make sure that external cloud object storage has been mounted.
E. When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

Answer: C

Explanation:
To create an external or unmanaged Delta Lake table, you need to use the EXTERNAL keyword in the CREATE TABLE statement. This indicates that the table is not managed by the catalog and the data files are not deleted when the table is dropped. You also need to provide a LOCATION clause to specify the path where the data files are stored. For example:
CREATE EXTERNAL TABLE events ( date DATE, eventId STRING, eventType STRING, data STRING) USING DELTA LOCATION '/mnt/delta/events'; This creates an external Delta Lake table named events that references the data files in the '/mnt/delta/events' path. If you drop this table, the data files will remain intact and you can recreate the table with the same statement.
References:
* https://docs.databricks.com/delta/delta-batch.html#create-a-table
* https://docs.databricks.com/delta/delta-batch.html#drop-a-table

NEW QUESTION # 56
Which of the following developer operations in the CI/CD can only be implemented through a GIT provider when using Databricks Repos.

A. Create and edit code
B. Commit and push code
C. Pull request and review process
D. Create a new branch
E. Trigger Databricks Repos pull API to update the latest version

Answer: C

Explanation:
Explanation
The answer is Pull request and review process, please note: the question is asking for steps that are being implemented in GIT provider not Databricks Repos.
See below diagram to understand the role of Databricks Repos and Git provider plays when building a CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops.
Diagram Description automatically generated

Bottom of Form
Top of Form

NEW QUESTION # 57
How does a Delta Lake differ from a traditional data lake?

A. Delta lake is an open storage format designed to replace flat files with additional capa-bilities that can provide reliability, security, and performance
B. Delta lake is proprietary software designed by Databricks that can provide reliability, security, and performance
C. Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance
D. Delta lake is a caching layer on top of data lake that can provide reliability, security, and performance
E. Delta lake is Datawarehouse service on top of data lake that can provide reliability, se-curity, and performance

Answer: C

Explanation:
Explanation
Answer is, Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance Delta lake is
* Open source
* Builds up on standard data format
* Optimized for cloud object storage
* Built for scalable metadata handling
Delta lake is not
* Proprietary technology
* Storage format
* Storage medium
* Database service or data warehouse

NEW QUESTION # 58
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day.
At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

A. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
B. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
C. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
D. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
E. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

Answer: D

Explanation:
The adjustment that will meet the requirement of processing records in less than 10 seconds is to decrease the trigger interval to 5 seconds. This is because triggering batches more frequently may prevent records from backing up and large batches from causing spill. Spill is a phenomenon where the data in memory exceeds the available capacity and has to be written to disk, which can slow down the processing and increase the execution time1. By reducing the trigger interval, the streaming query can process smaller batches of data more quickly and avoid spill. This can also improve the latency and throughput of the streaming job2.
The other options are not correct, because:
* Option A is incorrect because triggering batches more frequently does not allow idle executors to begin processing the next batch while longer running tasks from previous batches finish. In fact, the opposite is true. Triggering batches more frequently may cause concurrent batches to compete for the same resources and cause contention and backpressure2. This can degrade the performance and stability of the streaming job.
* Option B is incorrect because increasing the trigger interval to 30 seconds is not a good practice to ensure no records are dropped. Increasing the trigger interval means that the streaming query will process larger batches of data less frequently, which can increase the risk of spill, memory pressure, and timeouts12. This can also increase the latency and reduce the throughput of the streaming job.
* Option C is incorrect because the trigger interval can be modified without modifying the checkpoint directory. The checkpoint directory stores the metadata and state of the streaming query, such as the offsets, schema, and configuration3. Changing the trigger interval does not affect the state of the streaming query, and does not require a new checkpoint directory. However, changing the number of shuffle partitions may affect the state of the streaming query, and may require a new checkpoint directory4.
* Option D is incorrect because using the trigger once option and configuring a Databricks job to execute the query every 10 seconds does not ensure that all backlogged records are processed with each batch. The trigger once option means that the streaming querywill process all the available data in the source and then stop5. However, this does not guarantee that the query will finish processing within 10 seconds, especially if there are a lot of records in the source. Moreover, configuring a Databricks job to execute the query every 10 seconds may cause overlapping or missed batches, depending on the execution time of the query.
References: Memory Management Overview, Structured Streaming Performance Tuning Guide, Checkpointing, Recovery Semantics after Changes in a Streaming Query, Triggers

NEW QUESTION # 59
You are working on a table called orders which contains data for 2021 and you have the second table called orders_archive which contains data for 2020, you need to combine the data from two tables and there could be a possibility of the same rows between both the tables, you are looking to combine the results from both the tables and eliminate the duplicate rows, which of the following SQL statements helps you accomplish this?

A. SELECT distinct * FROM orders JOIN orders_archive on order.id = or-ders_archive.id
B. SELECT * FROM orders_archive MINUS SELECT * FROM orders
C. SELECT * FROM orders UNION SELECT * FROM orders_archive
(Correct)
D. SELECT * FROM orders INTERSECT SELECT * FROM orders_archive
E. SELECT * FROM orders UNION ALL SELECT * FROM orders_archive

Answer: C

Explanation:
Explanation
Answer is SELECT * FROM orders UNION SELECT * FROM orders_archive
UNION and UNION ALL are set operators,
UNION combines the output from both queries but also eliminates the duplicates.
UNION ALL combines the output from both queries.

NEW QUESTION # 60
The Delta Live Tables Pipeline is configured to run in Development mode using the Triggered Pipeline Mode.
what is the expected outcome after clicking Start to update the pipeline?

A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped
B. All datasets will be updated continuously and the pipeline will not shut down. The compute resources will persist with the pipeline
C. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated
D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional development and testing
E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist after the pipeline is stopped to allow for additional development and testing

Answer: B

Explanation:
Explanation
The answer is All datasets will be updated once and the pipeline will shut down. The compute re-sources will persist to allow for additional testing.
DLT pipeline supports two modes Development and Production, you can switch between the two based on the stage of your development and deployment lifecycle.
Development and production modes
When you run your pipeline in development mode, the Delta Live Tables system:
*Reuses a cluster to avoid the overhead of restarts.
*Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system:
*Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
*Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the buttons in the Pipelines UI to switch between develop-ment and production modes. By default, pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution behavior.
Storage locations must be configured as part of pipeline settings and are not affected when switching between modes.
Please review additional DLT concepts using below link
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-concepts.html#delta-live-tables-c

NEW QUESTION # 61
What type of table is created when you create delta table with below command?
CREATE TABLE transactions USING DELTA LOCATION "DBFS:/mnt/bronze/transactions"

A. Temp table
B. Managed table
C. Delta Lake table
D. Managed delta table
E. External table

Answer: E

Explanation:
Explanation
Anytime a table is created using the LOCATION keyword it is considered an external table, below is the current syntax.
Syntax
CREATE TABLE table_name ( column column_data_type...) USING format LOCATION "dbfs:/" format -> DELTA, JSON, CSV, PARQUET, TEXT I created the table command based on the above question, you can see it created an external table,

NEW QUESTION # 62
The sales team has asked the Data engineering team to develop a dashboard that shows sales per-formance for all stores, but the sales team would like to use the dashboard but would like to select individual store location, which of the following approaches Data Engineering team can use to build this functionality into the dashboard.

A. Use Databricks REST API to create a dashboard for each location
B. Use SQL UDF function to filter the data based on the location
C. Use query Parameters which then allow user to choose any location
D. Currently dashboards do not support parameters
E. Use Dynamic views to filter the data based on the location

Answer: C

Explanation:
Explanation
The answer is
Databricks supports many types of parameters in the dashboard, a drop-down list can be created based on a query that has a unique list of store locations.
Here is a simple query that takes a parameter for
SELECT * FROM sales WHERE field IN ( {{ Multi Select Parameter }} )
Or
SELECT * FROM sales WHERE field = {{ Single Select Parameter }}
Query parameter types
*Text
*Number
*Dropdown List
*Query Based Dropdown List
*Date and Time

NEW QUESTION # 63
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.
If task A fails during a scheduled run, which statement describes the results of this run?

A. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
B. Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.
C. Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.
D. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.
E. Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Answer: B

Explanation:
Explanation
When a Databricks job runs multiple tasks with dependencies, the tasks are executed in a dependency graph. If a task fails, the downstream tasks that depend on it are skipped and marked as Upstream failed. However, the failed task may have already committed some changes to the Lakehouse before the failure occurred, and those changes are not rolled back automatically. Therefore, the job run may result in a partial update of the Lakehouse. To avoid this, you can use the transactional writes feature of Delta Lake to ensure that the changes are only committed when the entire job run succeeds. Alternatively, you can use the Run if condition to configure tasks to run even when some or all of their dependencies have failed, allowing your job to recover from failures and continue running. References:
transactional writes: https://docs.databricks.com/delta/delta-intro.html#transactional-writes Run if: https://docs.databricks.com/en/workflows/jobs/conditional-tasks.html

NEW QUESTION # 64
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?

A. When tables are created, make sure that the external keyword is used in the create table statement.
B. Whenever a database is being created, make sure that the location keyword is used
C. When the workspace is being configured, make sure that external cloud object storage has been mounted.
D. When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.
E. Whenever a table is being created, make sure that the location keyword is used.

Answer: E

Explanation:
This is the correct answer because it ensures that this requirement is met. The requirement is that all tables in the Lakehouse should be configured as external Delta Lake tables. An external table is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created by using the location keyword to specify the path to an existing directory in a cloud storage system, such as DBFS or S3. By creating external tables, the data engineering team can avoid losing data if they drop or overwrite the table, as well as leverage existing data without moving or copying it.
Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Create an external table" section.

NEW QUESTION # 65
......

The Databricks Databricks-Certified-Professional-Data-Engineer exam is designed to assess the proficiency of the candidates in various areas related to data engineering on Databricks. Databricks-Certified-Professional-Data-Engineer exam focuses on topics such as data ingestion, data transformation, data modeling, data storage, and data processing. Databricks-Certified-Professional-Data-Engineer exam tests the candidates' knowledge of using Databricks to build data pipelines that can handle large volumes of data, process data in real-time, and integrate with other data sources.

Pass Databricks Databricks-Certified-Professional-Data-Engineer Exam Info and Free Practice Test: https://www.examstorrent.com/Databricks-Certified-Professional-Data-Engineer-exam-dumps-torrent.html

Try Before You Buy

Download a free sample of any of our exam questions and answers

Verified Databricks-Certified-Professional-Data-Engineer Dumps Q&As - Databricks-Certified-Professional-Data-Engineer Test Engine with Correct Answers [Q50-Q65]

Related Articles

Latest Exam Braindumps

Useful Links

Contact Us