Detect Program type using Bag of Words(BoW) Model

In this Project I have used the concept of Bag of Words to detremine the file/program type i.e whether the program is a Java program or Python Program. Currentlly the code is able to distinguish between only this 2 Program types. I am working to make it able to predict more program types.

[]

Following are the modules used : en_core_web_sm

os

pandas as pd

random

string

spacy

spacy.util.minibatch

re

How it works :

1. Data Collection :

For the data collection I have downloaded sample programs for both the language type i.e. both java and Python programs

2. Data Processing :

I have kept the above programs in respective folder type in sample files folder.
Iterate over the folders using os.walk and read the content of the files.
Call preprocess() function on our file content which will do the following task: a. Remove all the punctuations i.e. symbols like "-",",",".",etc. b. Remove all the stopwords like "a", "the", "is", "an", etc. c. Lemmatize the words i.e Convert running to run, walking to walk, etc. d. Return the cleaned words in form of a list.
Get all the file types present in our sample files and store them as a dictionary having two keys: a. The file type or file extensions which will act as our Labels. b. The cleaned content returned from our preproces() function.
We will the create a TextCategorizer with exclusive classes and "bow" architecture and add our labels to it.
Training our model with our sample file contents. We will use loop for more epochs, and re-shuffle the training data at the begining of each loop. Depending upon your computer;s performance it will take some time. Maybe even 30 minutes.
Predicting the file type of our test files. It contains 3 file types "ipynb", "py", and "java" file.
Checking the results of our prediction.

Best Part:

The best part about this project is that the more the sample program types ypu will provide the more accurately it will be able to detect the file type. Also currently the sample files do not include .net program, ruby or other programs. Buit if you provide sample files for those languages also, it will be able to predict the file types for those languages. It's pretty dynamic.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
sample_files2		sample_files2
test_files		test_files
.gitignore		.gitignore
BoW_model_using_spacy.ipynb		BoW_model_using_spacy.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detect Program type using Bag of Words(BoW) Model

How it works :

1. Data Collection :

2. Data Processing :

Best Part:

About

Releases

Packages

Languages

License

amanaation/detect_program_type

Folders and files

Latest commit

History

Repository files navigation

Detect Program type using Bag of Words(BoW) Model

How it works :

1. Data Collection :

2. Data Processing :

Best Part:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages