Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py4JError: An error occurred while calling o9368.fit #14375

Open
1 task done
NSManogna opened this issue Aug 20, 2024 · 4 comments
Open
1 task done

Py4JError: An error occurred while calling o9368.fit #14375

NSManogna opened this issue Aug 20, 2024 · 4 comments
Assignees
Labels

Comments

@NSManogna
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am training NerDLApproach for custom entities. when I increase the size of training data. i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Current Behavior

i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Expected Behavior

To get trained and model training should complete and then can be used for NER of new text

Steps To Reproduce

CoNll.zip

Spark NLP version and Apache Spark

i have launched johnsnowlab on ec2 instance of m5.2xlarge type

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

sparkNLP in johnsnowlab

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

please let me know if any information is needed

@maziyarpanahi
Copy link
Member

Could you please provide the actual code you used to start SparkSession, the pipeline, so we can reproduce it?

@NSManogna
Copy link
Author

The zip file i attached has .ipynb file which consist of the code

@maziyarpanahi
Copy link
Member

Please include the code here or on Google Colab. We are not allowed to download and open zip files for security reasons.

You just need to follow the template, nothing more and nothing less. The issue template is designed based on years of experience.

@NSManogna
Copy link
Author

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(["document"])
.setOutputCol("sentence")

tokenizer = Tokenizer()
.setInputCols(["sentence"])
.setOutputCol("token")

POSTag = PerceptronModel.pretrained()
.setInputCols("document", "token")
.setOutputCol("pos")

chunker = Chunker()
.setInputCols("sentence", "pos")
.setOutputCol("chunk")

embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(["document", "token"])
.setOutputCol("embeddings")

ner_model =NerDLApproach()
.setInputCols(["sentence", "token", "embeddings"])
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(10)
.setLr(0.001)
.setPo(0.005)
.setBatchSize(8)
.setDropout(0.5)
.setValidationSplit(0.2)

ner_converter = NerConverter()
.setInputCols(["sentence","token","ner"])
.setOutputCol("entities")

c_pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
POSTag
])

import pandas as pd
import ast
from pyspark.sql.functions import explode, col

#df = spark.read.csv("pii_dataset.csv", header=True, inferSchema=True)

df=pd.read_csv("pii_dataset.csv")
#df = df.head(1000)

df1 = spark.createDataFrame(df)

f_model=c_pipeline.fit(df1)
result = f_model.transform(df1)

#result.select( explode(col("chunk.result")).alias("chunk_tag")).show(truncate=False)

df_new = df1.join(result.select("text", "pos.result"), on="text", how="left")
df_new = df_new.withColumnRenamed("result", "pos_tags")

#df_new1 = df_new.join(result.select("text", "chunk.result"), on="text", how="left")
#df_new1 = df_new1.withColumnRenamed("result", "chunks")

import ast

df_new2=df_new.toPandas()
df_new2['tokens'] = df_new2['tokens'].apply(ast.literal_eval)
df_new2['labels'] = df_new2['labels'].apply(ast.literal_eval)

selected_df=spark.createDataFrame(df_new2)
rows_as_dicts = selected_df.rdd.map(lambda row: row.asDict()).collect()

def convert_to_conll(sentences):
conll_lines = []
for sentence in sentences:
tokens, labels, pos_tags = sentence['tokens'], sentence['labels'], sentence['pos_tags']
for token, label ,pos_tag in zip(tokens, labels,pos_tags ):
conll_lines.append(f"{token} {pos_tag} \t_ {label}")
conll_lines.append("") # Blank line to separate sentences
return "\n".join(conll_lines)

conll_data = convert_to_conll(rows_as_dicts)

with open('annotations.conll', 'w') as file:
file.write(conll_data)

print("Dataset converted to CoNLL format and saved as 'annotations.conll'.")

nerpipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])

from sparknlp.training import CoNLL

conll_instance = CoNLL()

training_data = conll_instance.readDataset(spark=spark, path ='annotations.conll')

model = nerpipeline.fit(training_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants