1 Description

  • This course is part of the second semester of M1 DS2E/SE. It aims at providing students with fundamental programming concepts including data structures, libraries, re-usable functions, efficient codes.

 

  • Each session will be divided into two parts, first we will go through the course using this site as a support. In the second part of the course students will be asked to solve exercises and present their approaches to others. The aim is to familiarize students with presenting and explaining their codes, but it is even more important that it serves students to benchmark their different approaches in order to better understand how to improve their code.

 

This lecture is structured as follows :

  • Chapter ‘Basics’ will go through the simplest operations we can perform. It covers the different types of objects present in the two languages, the use of control flows, and the creation of functions.

     

  • Chapter ‘Arrays/Vectors’ introduces the use of vectors and matrices to perform operations. It highlights the benefits of vectorization and briefly discusses how to deal with sparse matrix.

     

  • Chapter ‘Regex’ gives an overview of regular expression and string manipulation. It summarizes the first two chapters by applying the regular expressions in a very basic webscrapping exercise.

     

  • Chapter ‘Data Analysis’, in R we will use of Tibbles and data.table to store data, while in Python we will focus on pandas. Basic operations on dataframe like slicing, filtering, grouping, merging are covered in the chapter as well as reading and writing data.

     

  • Chapter ‘Best Practices’ is about coding in a conventional way to improve the readability of the code by others.

     

  • Chapter ‘Efficient Programming’ is an introduction to parallel computing in which its benefits and limitations will be discussed.

 

2 Exams

There are two different grades in this course :

  • A project to be handed in by groups of 2-3. This project is totally unbounded but here is a guideline:
    • The project has to do data processing. After defining the type of question you want to address, select the type of data you will use (cross sectional data, emails, time series, articles, websites, maps, simulated databases, kaggle, UCI …).
    • The goal of this project is to create a tool to answer your question. This tool can be almost anything, there must be something you always dreamed to automate.
    • The tool developed must be documented and usable through the Github platform. The slides of the presentation must be in markdown.

       

  • The project must be presented orally, students will discuss the issues raised in the project. They will present the weak and strong points of their codes and the different approaches used to address them during the creation of their tools.

4 Materials used in this course

Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

Zed A. Shaw. Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code

Luciano Ramalho. Fluent Python

Robin Lovelace. Efficient R Programming by Colin Gillespie

Wes McKinney. Python for Data Analysis

https://www.anotherbookondatascience.com/

https://www.business-science.io/business/2018/10/08/python-and-r.html

https://www.practicaldatascience.org/html/vars_v_objects.html

https://learnanalyticshere.wordpress.com/2015/05/14/clash-of-the-titans-r-vs-python/

https://www.statmethods.net/input/datatypes.html

https://thomas-cokelaer.info/tutorials/python/lists.html

https://www.python.org/dev/peps/pep-0008/#introduction

https://www.datacamp.com/community/tutorials/r-tutorial-apply-family#as

https://towardsdatascience.com/the-ultimate-beginners-guide-to-numpy-f5a2f99aef54

https://towardsdatascience.com/getting-started-with-git-and-github-6fcd0f2d4ac6

https://docs.oracle.com/javase/tutorial/java/data/characters.html

https://www.tutorialspoint.com/python/python_reg_expressions.html

https://www.w3schools.com/python/python_ref_string.asp

https://thomas-cokelaer.info/tutorials/python/strings.html

http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_R_Python_Data_Perfs.pdf

https://juba.github.io/tidyverse/06-tidyverse.html

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

http://python-simple.com/python-pandas/panda-intro.php

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html

https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2

To be completed..