UGA Bulletin

Course Description

The concepts and techniques of architecting data to support data-intensive applications and large-scale data analysis workflows. The course surveys the current landscape of hardware and software for data storage and processing. The course covers architecting scalable data solutions and their deployment into production in the cloud.

Additional Requirements for Graduate Students:
In addition to the undergraduate requirements, graduate students will complete an extra project deliverable to deploy a data solution in the cloud.

Athena Title

Data Engineering

Non-Traditional Format

Students should be able to demonstrate a working knowledge of the Java programming language.

Undergraduate Prerequisite

[MIST 4610 or MIST 4610E or MIST 7600 with a minimum grade of C (2.0)] and [MIST 4600 or MIST 4600E with a minimum grade of C (2.0)]

Graduate Prerequisite

[MIST 4610 or MIST 4610E or MIST 7600 with a minimum grade of C (2.0)] and [MIST 4600 or MIST 4600E with a minimum grade of C (2.0)]

Semester Course Offered

Offered spring

Grading System

A - F (Traditional)

Course Objectives

Concepts Infrastructure for data-intensive applications (storage, computation, and networking) Data storage and retrieval with modern large-scale data stores (databases, data lakes, data streams) Engineering data processing pipelines using functional programming concepts (map, reduce, filter) Servicing data lakes and data streams for analytics and transactional operation Techniques Functional programming with Java (Lambda functions, streams, and parallel processing) Distributed data processing using Apache Spark (ingesting, streaming, transforming, and storing data) Using the cloud environment (AWS) for deploying data solutions Using SQL to extend traditional database skills for big data analytics

Topical Outline

Module 1: Architecture and infrastructure The infrastructure for intensive data applications Parallel and distributed computing architectures The software landscape for big data Module 2: Functional programming and the map-reduce paradigm Functional programming in Java Streams and functional operations (map, reduce, filter) Parallelism and concurrency models Module 3: Scaling-out, distributed computing on computer clusters Apache Spark, the unified analytics engine for distributed big data processing Data ingestion and storage from/to files and data stores Implementing data pipelines with Spark Resilient Distributed Datasets (RDDs), Spark Data Frames, and Spark SQL Module 4: Deployment into production Batching, queuing, and stream processing The lambda architecture on the cloud Deploying Spark on Amazon Web Services (AWS) Elastic MapReduce (EMR)

Syllabus

Public CV

publish