Fundamentals of Scalable Data Science

开始时间: 08/29/2020 持续时间: Unknown

所在平台: Coursera

课程类别: 其他类别

大学或机构: CourseraNew



Explore 1600+ online courses from top universities. Join Coursera today to learn data science, programming, business strategy, and more.


第一个写评论        关注课程


Apache Spark is the de-facto standard for large scale data processing. This is the first course of a series of courses towards the IBM Advanced Data Science Specialization. We strongly believe that is is crucial for success to start learning a scalable data science platform since memory and CPU constraints are to most limiting factors when it comes to building advanced machine learning models. In this course we teach you the fundamentals of Apache Spark using python and pyspark. We'll introduce Apache Spark in the first two weeks and learn how to apply it to compute basic exploratory and data pre-processing tasks in the last two weeks. Through this exercise you'll also be introduced to the most fundamental statistical measures and data visualization technologies. This gives you enough knowledge to take over the role of a data engineer in any modern environment. But it gives you also the basis for advancing your career towards data science. Please have a look at the full specialization curriculum: If you choose to take this course and earn the Coursera course certificate, you will also earn an IBM digital badge. To find out more about IBM digital badges follow the link After completing this course, you will be able to: • Describe how basic statistical measures, are used to reveal patterns within the data • Recognize data characteristics, patterns, trends, deviations or inconsistencies, and potential outliers. • Identify useful techniques for working with big data such as dimension reduction and feature selection methods • Use advanced tools and charting libraries to: o improve efficiency of analysis of big-data with partitioning and parallel analysis o Visualize the data in an number of 2D and 3D formats (Box Plot, Run Chart, Scatter Plot, Pareto Chart, and Multidimensional Scaling) For successful completion of the course, the following prerequisites are recommended: • Basic programming skills in python • Basic math • Basic SQL (you can get it easily from if needed) In order to complete this course, the following technologies will be used: (These technologies are introduced in the course as necessary so no previous knowledge is required.) • Jupyter notebooks (brought to you by IBM Watson Studio for free) • ApacheSpark (brought to you by IBM Watson Studio for free) • Python This course takes four weeks, 4-6h per week

可扩展数据科学的基础知识:Apache Spark是用于大规模数据处理的事实上的标准。这是面向IBM Advanced Data Science Specialization的系列课程中的第一门课程。我们坚信,这对于成功开始学习可扩展的数据科学平台至关重要,因为在构建高级机器学习模型时,内存和CPU限制是最大的限制因素。 在本课程中,我们教您使用python和pyspark的Apache Spark基础知识。我们将在前两周内介绍Apache Spark,并在后两周内学习如何将其用于计算基本的探索性和数据预处理任务。通过本练习,还将向您介绍最基本的统计量度和数据可视化技术。 这为您提供了足够的知识,可以在任何现代环境中担任数据工程师的角色。但这也为您迈向数据科学的职业提供了基础。 请查看完整的专业课程: 如果您选择参加本课程并获得Coursera课程证书,那么您还将获得IBM数字徽章。要查找有关IBM数字徽章的更多信息,请访问链接。 完成本课程后,您将能够: •描述如何使用基本统计量来揭示数据中的模式 •识别数据特征,模式,趋势,偏差或不一致以及潜在的异常值。 •确定用于处理大数据的有用技术,例如降维和特征选择方法 •使用高级工具和图表库来:       o通过分区和并行分析提高大数据分析效率       o以多种2D和3D格式(箱形图,运行图,散点图,帕累托图和多维缩放)可视化数据 为了成功完成课程,建议满足以下先决条件: •python的基本编程技巧 •基础数学 •基本SQL(如果需要,可以从轻松获得) 为了完成本课程,将使用以下技术: (这些技术在课程中会根据需要进行介绍,因此不需要任何先验知识。) •Jupyter笔记本(IBM Watson Studio免费提供给您) •ApacheSpark(IBM Watson Studio免费提供给您) •Python 这门课程需要四个星期,每周4-6小时


Analysis of data starts with a hypothesis and through exploration, those hypothesis are tested. Exploratory analysis in IoT considers large amounts of data, past or current, from multiple sources and summarizes its main characteristics. Data is strategically inspected, cleaned, and models are created with the purpose of gaining insight, predicting future data, and supporting decision making. This learning module introduces methods for turning raw IoT data into insight



The value of IoT can be found within the analysis of data gathered from the system under observation