Classification of programming languages via IBM Watson Studio and Jupyter Notebook

We will be using the Jupyter notebook in conjunction with IBM Watson Natural Language Classifier in order to build a model that can predict the programming language that was used to write any piece of code. This is a useful model for data scientists to identify the programming language used in any given code.

Jupyter Notebook

The Jupyter Notebook is part of Project Jupyter. It is an open source web app that allows the creation and sharing of documents. It is primarily used for live code, equations, and visualizations. It can be used in a variety of ways, like statistical analysis, data modeling, data cleaning, numerical simulations, machine learning, and more. Jupyter Notebooks support over 40 programming languages and allow easy sharing over Github, Dropbox and the like.

IBM Watson Natural Language Classifier

The Natural Language Classifier uses machine learning to identify and classify text. The text can be analyzed, labeled and organized into customized categories that the user can specify and define. The main aim is to provide natural language processing (NPC), and the heart of NPC is text classification. The Watson Natural Language Classifier allows scalable text classification, where you can automate workflows, extract actionable insights and improve processes over time. It has higher accuracy levels due to its machine learning capabilities and can support multiple languages.

Use Case

In this scenario, the Jupyter Notebook can be used in the IBM Watson Studio, in order to build a model that can be used to identify the programming language used in a code. This is done on the basis of text classification and analysis of the given code. This model can then be evaluated using the Watson Natural Language Classifier.

This can be used in a variety of ways to identify even snippets of code. Data scientists can use this to look at Github contents for an easy way to identify the languages used. This allows easy extraction of structured and unstructured data, which then be analyzed for pattern identification, and to

gain insights. It can also be used to identify the choice of languages preferred by developers for different types of applications.

Development and Process Description

With the IBM Watson Natural Language Classifier, data scientists can build a model that checks documents for text in order to classify them and allows them to evaluate the data and results. The classification is based on custom categories that the user can define specifications and parameters for. With a Jupyter Notebook running on IBM Watson Studio, the data can be cleaned, structured, extracted, and manipulated. The Watson Developer Cloud SDK for Python provides APIs that can be used to create models in IBM Watson Natural Language Classifier.

You must first have the IBM Studio installed. Then, create an IBM Watson Studio Workspace. Next, you will have to create the Jupyter notebook and the Watson Natural Language Classifier Instance. For this code pattern, you will first need to build a labeled data set. Then, you will use the IBM Watson Natural Language Classifier to build a predictive model. Then, build a predictive model within the Jupyter Notebook. The final step would be to configure and use the APIs from the Watson Developer Cloud SDK.

Process Flow

In this Code Pattern, we will use Jupyter Notebooks in IBM Watson Studio to build a model that predicts a code’s programming language based on its text. The model will then be evaluated using IBM’s Watson Natural Language classifier.

When the reader has completed this Code Pattern, they will understand how to:

  •      Build a labeled data set.
  •      Use Watson Natural Language Classifier to create a predictive model.
  •      Build a predictive model within a Jupyter Notebook.
  •      Configure and use Watson APIs.

Flow

  1. The developer creates an IBM Watson Studio Workspace.
  2. Using Watson Studio, the developer creates a Jupyter notebook and Watson Natural

Language Classifier instance.

  1. The user can create a new dataset from Github, or use the existing one in this repo.
  2. The user interacts with the notebook to Build Naive Bayes Classifier and Natural Language Classifier instance using the Watson Developer Cloud SDK
  3. The notebook Python code can use NLC APIs to create and use a classifier.

Included components

Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.

Jupyter Notebook: An open source web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text.

Watson Natural Language Classifier: Understand the intent behind text passages though custom classifiers, complete with a confidence score.

Featured technologies

Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.

Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Steps

  1. Sign up for the Watson Studio
  2. Create a project and add services
  3. Create a notebook in Watson Studio
  4. Run the notebook in Watson Studio
  5. Add or change data set

 

  1. Sign up for Watson Studio

Sign up for IBM’s Watson Studio. By creating a project in Watson Studio a free tier Object

Storage service will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.

Note: When creating your Object Storage service, select the Free storage type in order to avoid having to pay an upgrade fee.

  1. Create a project and add services

In Watson Studio create a new project which will contain the notebook and connections to the IBM Cloud services. Choose the Data Science project tile.

Associate the project with a Natural Language Classifier service instance. Go to the Settings tab in the new Project and scroll down to Associated Services. Click + and select Watson from the drop-down menu. Select an existing Watson Natural Language Classifier service or create a new one for free.

Once your Natural Language Classifier (NLC) service is created, copy the credentials and save them for later, when you will use them in your Jupyter notebook.

  1. Create a notebook in Watson Studio

In the Assets tab of the new project, select Notebooks -> + New notebook OR select +

Add to the project -> Notebook.

Select the From URL tab.

Enter a name for the notebook.

Optionally, enter a description for the notebook.

Under Notebook URL provide the following

url: https://raw.githubusercontent.com/IBM/programming-language- classifier/master/notebooks/buildmodels.ipynb

Click the Create button.

  1. Run the notebook in Watson Studio

Place your cursor in the first code block in the notebook.

Click on the Run icon to run the code in the cell.

Move your cursor to each code cell and run the code in it. Read the comments for each cell to understand what the code is doing.

Important when the code in a cell is still running, the label to the left changes to In [*]:.

Do not continue to the next cell until the code is finished running, and the [*] has changed to a number.

When you get to the cell that says ## 3.0 Create Classifier with Watson NLC and Evaluate Classification Accuracy, insert the username and password that you saved from your Watson Natural Language Classifier instance into the code before running it.

When you get to the cell that says 3.2 Add Classifier ID, Add the classifier_id that is in the output after running 3.1 Create Classifier.

Continue running each cell until you finish the entire notebook.

  1. Add or change data set

The data used was generated using tools/getdata.ipynb. To use your own or another github repository for analysis, use this notebook and export the data via HTTP. Point to it in notebooks/buildmodels.ipynb section 1.0 using wget.download().

Sample output

To see the notebook with sample output, load examples/exampleNotebook.ipynb.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.