Py4JError: An error occurred while calling o9368.fit #14375

NSManogna · 2024-08-20T11:40:19Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am training NerDLApproach for custom entities. when I increase the size of training data. i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Current Behavior

i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Expected Behavior

To get trained and model training should complete and then can be used for NER of new text

Steps To Reproduce

CoNll.zip

Spark NLP version and Apache Spark

i have launched johnsnowlab on ec2 instance of m5.2xlarge type

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

sparkNLP in johnsnowlab

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

please let me know if any information is needed

maziyarpanahi · 2024-08-20T12:37:03Z

Could you please provide the actual code you used to start SparkSession, the pipeline, so we can reproduce it?

NSManogna · 2024-08-21T05:06:25Z

The zip file i attached has .ipynb file which consist of the code

maziyarpanahi · 2024-08-21T05:15:18Z

Please include the code here or on Google Colab. We are not allowed to download and open zip files for security reasons.

You just need to follow the template, nothing more and nothing less. The issue template is designed based on years of experience.

NSManogna · 2024-08-21T05:53:10Z

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(["document"])
.setOutputCol("sentence")

tokenizer = Tokenizer()
.setInputCols(["sentence"])
.setOutputCol("token")

POSTag = PerceptronModel.pretrained()
.setInputCols("document", "token")
.setOutputCol("pos")

chunker = Chunker()
.setInputCols("sentence", "pos")
.setOutputCol("chunk")

embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(["document", "token"])
.setOutputCol("embeddings")

ner_model =NerDLApproach()
.setInputCols(["sentence", "token", "embeddings"])
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(10)
.setLr(0.001)
.setPo(0.005)
.setBatchSize(8)
.setDropout(0.5)
.setValidationSplit(0.2)

ner_converter = NerConverter()
.setInputCols(["sentence","token","ner"])
.setOutputCol("entities")

c_pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
POSTag
])

import pandas as pd
import ast
from pyspark.sql.functions import explode, col

#df = spark.read.csv("pii_dataset.csv", header=True, inferSchema=True)

df=pd.read_csv("pii_dataset.csv")
#df = df.head(1000)

df1 = spark.createDataFrame(df)

f_model=c_pipeline.fit(df1)
result = f_model.transform(df1)

#result.select( explode(col("chunk.result")).alias("chunk_tag")).show(truncate=False)

df_new = df1.join(result.select("text", "pos.result"), on="text", how="left")
df_new = df_new.withColumnRenamed("result", "pos_tags")

#df_new1 = df_new.join(result.select("text", "chunk.result"), on="text", how="left")
#df_new1 = df_new1.withColumnRenamed("result", "chunks")

import ast

df_new2=df_new.toPandas()
df_new2['tokens'] = df_new2['tokens'].apply(ast.literal_eval)
df_new2['labels'] = df_new2['labels'].apply(ast.literal_eval)

selected_df=spark.createDataFrame(df_new2)
rows_as_dicts = selected_df.rdd.map(lambda row: row.asDict()).collect()

def convert_to_conll(sentences):
conll_lines = []
for sentence in sentences:
tokens, labels, pos_tags = sentence['tokens'], sentence['labels'], sentence['pos_tags']
for token, label ,pos_tag in zip(tokens, labels,pos_tags ):
conll_lines.append(f"{token} {pos_tag} \t_ {label}")
conll_lines.append("") # Blank line to separate sentences
return "\n".join(conll_lines)

conll_data = convert_to_conll(rows_as_dicts)

with open('annotations.conll', 'w') as file:
file.write(conll_data)

print("Dataset converted to CoNLL format and saved as 'annotations.conll'.")

nerpipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])

from sparknlp.training import CoNLL

conll_instance = CoNLL()

training_data = conll_instance.readDataset(spark=spark, path ='annotations.conll')

model = nerpipeline.fit(training_data)

NSManogna added the question label Aug 20, 2024

NSManogna assigned maziyarpanahi Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Py4JError: An error occurred while calling o9368.fit #14375

Py4JError: An error occurred while calling o9368.fit #14375

NSManogna commented Aug 20, 2024

maziyarpanahi commented Aug 20, 2024

NSManogna commented Aug 21, 2024

maziyarpanahi commented Aug 21, 2024

NSManogna commented Aug 21, 2024

Py4JError: An error occurred while calling o9368.fit #14375

Py4JError: An error occurred while calling o9368.fit #14375

Comments

NSManogna commented Aug 20, 2024

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

maziyarpanahi commented Aug 20, 2024

NSManogna commented Aug 21, 2024

maziyarpanahi commented Aug 21, 2024

NSManogna commented Aug 21, 2024