1 Description

  • This course is part of the second semester of M1 DS2E/SE. It aims at providing students with fundamental programming concepts including data structures, libraries, re-usable functions, efficient codes. Python and R programming languages will be used, both having become very popular for data science in academia and the private sector. The course will also focus on how to report results of data analysis using Markdown and hosts of source codes (GitHub).

 

  • Each session will be divided into two parts, first we will go through the course using this site as a support. In the second part of the course students will be asked to solve exercises and present their approaches to others. The aim is to familiarize students with presenting and explaining their codes, but it is even more important that it serves students to benchmark their different approaches in order to better understand how to improve their code.

 

This lecture is structured as follows (each section is covered with R and Python):

  • Chapter ‘Basics’ will go through the simplest operations we can perform. It covers the different types of objects present in the two languages, the use of control flows, and the creation of functions.

     

  • Chapter ‘Arrays/Vectors’ introduces the use of vectors and matrices to perform operations. It highlights the benefits of vectorization and briefly discusses how to deal with sparse matrix.

     

  • Chapter ‘Data Analysis’, in R we will use of Tibbles and data.table to store data, while in Python we will focus on pandas. Basic operations on dataframe like slicing, filtering, grouping, merging are covered in the chapter as well as reading and writing data.

     

  • Chapter ‘Best Practices’ is about coding in a conventional way to improve the readability of the code by others. It will also explain how to communicate and make your code accessible via github, but also how to produce reports or feed a blog with RMarkdown.

 

2 Exams

There are three different grades in this course :

  • A written exam at the end of the semester (16/12/2025).

     

  • A project to be handed in by groups of 2-3 (13/01/2026). This project is totally unbounded but here is a guideline:

    Project Overview: Personal Data Analysis Tool

    Create a tool that analyzes your own data extracted from an online platform using GDPR data export rights. The tool should be reusable and updateable when new data becomes available.

    Recommended Platforms:

    • Spotify: Very structured music data, official API available for updates. Analyze your listening habits, favorite genres, discovery patterns.
    • Google Takeout: Extremely rich dataset (YouTube history, Maps locations, Chrome browsing, Gmail, etc.). Multiple analysis possibilities.
    • Instagram/Facebook: Social interactions, posting patterns, engagement metrics, network analysis.
    • Amazon: Purchase history, spending patterns, product categories, temporal trends.
    • LinkedIn: Professional journey, network growth, job search activities, engagement with content.

    Key Requirements

    • GDPR Request: Submit your data export request at the start of the project (platforms have up to 30 days to respond, typically 2-3 days).
    • Personal Focus: Choose what you want to study about YOUR OWN usage of the platform.
    • Generic Tool: Build a reusable tool that could analyze anyone’s export from that platform, not just yours.
    • Update Mechanism: Implement a system to refresh the analysis when new data is added (either via API or by detecting new export files).

    Critical Reminders: Data Privacy

    • Your personal data is sensitive and private. Never share it with anyone, never commit it to Git repositories.
    • Use .gitignore files properly.

 

  • The project must be presented orally, students will discuss the issues raised in the project. They will present the weak and strong points of their codes and the different approaches used to address them during the creation of their tools.

4 Materials used in this course

Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

Zed A. Shaw. Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code

Luciano Ramalho. Fluent Python

Robin Lovelace. Efficient R Programming by Colin Gillespie

Wes McKinney. Python for Data Analysis

https://www.anotherbookondatascience.com/

https://www.business-science.io/business/2018/10/08/python-and-r.html

https://www.practicaldatascience.org/html/vars_v_objects.html

https://learnanalyticshere.wordpress.com/2015/05/14/clash-of-the-titans-r-vs-python/

https://www.statmethods.net/input/datatypes.html

https://thomas-cokelaer.info/tutorials/python/lists.html

https://www.python.org/dev/peps/pep-0008/#introduction

https://www.datacamp.com/community/tutorials/r-tutorial-apply-family#as

https://towardsdatascience.com/the-ultimate-beginners-guide-to-numpy-f5a2f99aef54

https://towardsdatascience.com/getting-started-with-git-and-github-6fcd0f2d4ac6

https://docs.oracle.com/javase/tutorial/java/data/characters.html

https://www.tutorialspoint.com/python/python_reg_expressions.html

https://www.w3schools.com/python/python_ref_string.asp

https://thomas-cokelaer.info/tutorials/python/strings.html

http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_R_Python_Data_Perfs.pdf

https://juba.github.io/tidyverse/06-tidyverse.html

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

http://python-simple.com/python-pandas/panda-intro.php

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html

https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2

To be completed..