Data Manipulation at Scale: Systems and Algorithms

开始时间: 08/01/2020 持续时间: Unknown

所在平台: Coursera

课程类别: 其他类别

大学或机构: CourseraNew



Explore 1600+ online courses from top universities. Join Coursera today to learn data science, programming, business strategy, and more.


第一个写评论        关注课程


Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales. In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered. You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to: Learning Goals: 1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields. 2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models. 3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics 4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends. 5. “Think” in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages. write programs in Spark 6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams

大规模的数据处理:系统和算法:数据分析已取代数据获取,这已成为基于证据的决策制定的瓶颈-我们正在淹没其中。从大型,异构且嘈杂的数据集中提取知识不仅需要强大的计算资源,还需要编程抽象才能有效地使用它们。过去十年中出现的抽象概念融合了并行数据库,分布式系统和编程语言的思想,从而创建了一类新的可伸缩数据分析平台,这些平台构成了现实规模的数据科学的基础。 在本课程中,您将学习相关系统的概况,它们所依赖的原则,它们的权衡以及如何根据您的要求评估其效用。您将学习实用系统是如何从计算机科学的研究前沿中衍生出来的,以及即将出现的系统。将介绍云计算,SQL和NoSQL数据库,MapReduce及其产生的生态系统,Spark及其同时代人以及用于图形和数组的专用系统。 您还将学习数据科学的历史和背景,术语所暗示的技能,挑战和方法,以及如何构建数据科学项目。在本课程结束时,您将能够: 学习目标: 1.描述与数据科学项目相关的常见模式,挑战和方法,以及它们与相关领域的项目有何不同。 2.识别并使用与可伸缩数据处理相关的编程模型,包括关系代数,mapreduce和其他数据流模型。 3.使用适用于大规模分析的数据库技术,包括驱动并行数据库,并行查询处理和数据库内分析的概念 4.评估键值存储和NoSQL系统,描述它们在可比系统中的权衡,该空间中重要示例的细节以及未来趋势。 5. MapReduce中的“思考”功能,可为包括Hadoop和Spark在内的系统有效地编写算法。您将了解它们的局限性,设计细节,它们与数据库的关系以及它们相关的算法,扩展和语言生态系统。 用Spark编写程序 6.描述用于图形,数组和流的专用大数据系统的概况


Understand the terminology and recurring principles associated with data science, and understand the structure of data science projects and emerging methodologies to approach them. Why does this emerging field exist? How does it relate to other fields? How does this course distinguish itself? What do data science projects look like, and how should they be approached? What are some examples of data science projects?





Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making ---