Chapter 2 Software packages
In general, a Data Scientist role at DairyNZ involves data extracting, manipulating, transforming, pre-processing and generating predictions out of data. In order to do so, they would require to have various statistical tools, programming languages, access to databases and so on.
This section will share some commonly used software packages required by a Data Scientist to carry out their data operations.
NOTE: You can get required software packages installed on your computer with an AssistMe support request from Moogle.
2.1 Recommended packages
2.1.1 RStudio and R
These are commonly and widely used tools for a wide range of econometric analysis, statistical analysis, machine learning, web-app development and even documentation (like this guidebook) etc.
We recommend having the latest distribution of R from the CRAN.
Download RStudio (commonly used integrated development environment (IDE) for R Programming Language) from here.
Recommendations: getlatest versions of these software packages, preferably 64bit versions, where possible.
2.1.2 Rtools
R-tools are a set of tools that allow basic administration of system commands for R/RStudio. Rtools are also useful for building packages from source and performing low-level system administration such as file handling for R.
https://cran.r-project.org/bin/windows/Rtools/rtools40.html
Recommendations: getlatest versions of these software packages, preferably 64bit versions, where possible.
2.1.3 Java
Java Development Kit (JDK) from Oracle (preferred) or Java Runtime Environment (JRE) is required for commonly used R packages to function like rJava. Package rJava and other similar packages provide the basis to other R packages suck as xlsx, glmulti, xlsxjars etc.
In Windows, you may have to set the JAVA_HOME Environment Variables. Here is a guide on how to add/modify Environment Variable on Windows.
2.1.4 GIT
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency1.
2.1.5 Notepad ++ (for Windows users)
It is always good to have a supplementary general-purpose text and source code editor for Microsoft Windows. Notepad ++ supports almost all major programming languages, supports tabbed editing, allows working with multiple open files in a single window. Notepad++ is lightning fast and is distributed as free software.
2.2 Optional packages
2.2.1 Microsoft Visual Studio
Microsoft Visual Studio is an other general purpose IDE from Microsoft. It supports several programming languages and is commonly used to develop computer programs, as well as websites, web apps, web services and mobile apps etc.
2.2.2 Anaconda
Anaconda is very powerful tool for data scientists who use Python. Anaconda includes Python installation with several useful tools such as Jupyter Notebook, Spyder etc.
2.3 Access to services
In addition to the above software packages, you would also need access to the following online services to be able to utilize DairyNZ’s Data Science infrastructure fully.
RStudio Connect - publisher access
RStudio Workbench - user license
Snowflake
GitHub: DairyNZ company account
NOTE: Access to these services can be requested via AssistMe support ticket for the Digital Services.
Please feel free to add any further recommendations!!!