Course ID: | MIST 5757/7757. 3 hours. |
Course Title: | Data Engineering |
Course Description: | The concepts and techniques of architecting data to support data-intensive applications and large-scale data analysis workflows. The course surveys the current landscape of hardware and software for data storage and processing. The course covers architecting scalable data solutions and their deployment into production in the cloud. |
Oasis Title: | Data Engineering |
Nontraditional Format: | Students should be able to demonstrate a working knowledge of the Java programming language. |
Undergraduate Prerequisite: | [MIST 4610 or MIST 4610E or MIST 7600 with a minimum grade of C (2.0)] and [MIST 4600 or MIST 4600E with a minimum grade of C (2.0)] |
Graduate Prerequisite: | [MIST 4610 or MIST 4610E or MIST 7600 with a minimum grade of C (2.0)] and [MIST 4600 or MIST 4600E with a minimum grade of C (2.0)] |
Semester Course Offered: | Offered spring semester every year. |
Grading System: | A-F (Traditional) |
|
Course Objectives: | Concepts
Infrastructure for data-intensive applications (storage, computation, and networking)
Data storage and retrieval with modern large-scale data stores (databases, data lakes, data streams)
Engineering data processing pipelines using functional programming concepts (map, reduce, filter)
Servicing data lakes and data streams for analytics and transactional operation
Techniques
Functional programming with Java (Lambda functions, streams, and parallel processing)
Distributed data processing using Apache Spark (ingesting, streaming, transforming, and storing data)
Using the cloud environment (AWS) for deploying data solutions
Using SQL to extend traditional database skills for big data analytics |
Topical Outline: | Module 1: Architecture and infrastructure
The infrastructure for intensive data applications
Parallel and distributed computing architectures
The software landscape for big data
Module 2: Functional programming and the map-reduce paradigm
Functional programming in Java
Streams and functional operations (map, reduce, filter)
Parallelism and concurrency models
Module 3: Scaling-out, distributed computing on computer clusters
Apache Spark, the unified analytics engine for distributed big data processing
Data ingestion and storage from/to files and data stores
Implementing data pipelines with Spark Resilient Distributed Datasets (RDDs), Spark Data Frames, and Spark SQL
Module 4: Deployment into production
Batching, queuing, and stream processing
The lambda architecture on the cloud
Deploying Spark on Amazon Web Services (AWS) Elastic MapReduce (EMR) |
Honor Code Reference: | As a University of Georgia student, you have agreed to abide by the University's academic honesty policy, "A Culture of Honesty, " and the Student Honor Code. All academic work must meet the standards described in "A Culture of Honesty." Lack of knowledge of the academic honesty policy is not a reasonable explanation for a violation. Questions related to course assignments and the academic honesty policy should be directed to the instructor. |