Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-235] Testset generation. ValueError: a cannot be empty unless no samples are taken - Not use adaptor #871

Open
JPonsa opened this issue Apr 15, 2024 · 16 comments
Assignees
Labels
bug Something isn't working linear Created by Linear-GitHub Sync

Comments

@JPonsa
Copy link

JPonsa commented Apr 15, 2024

Testing it with a single small document splitter in 5 chuncks. Repeating the test with 2 documents. To see if that could be the issue. This could be related to issue below but I am not using adaptor. My text is in English generated using langchain's RecursiveJsonSplitter

#625

using ragas 0.1.7 on a linux machine.

   eval_ds = generator.generate_with_langchain_docs(
        docs,
        test_size=5,
        distributions={simple: 0.4, reasoning: 0.4, multi_context: 0.2},
    )
Filename and doc_id are the same for all nodes.
Traceback (most recent call last):
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/./src/evaluation/RAGAS.py", line 191, in <module>
    main(args)
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/./src/evaluation/RAGAS.py", line 97, in main
    eval_ds = generator.generate_with_langchain_docs(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/ragas/testset/generator.py", line 179, in generate_with_langchain_docs
    return self.generate(
           ^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/ragas/testset/generator.py", line 248, in generate
    for n in self.docstore.get_random_nodes(k=test_size)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/ragas/testset/docstore.py", line 328, in get_random_nodes
    nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "numpy/random/_generator.pyx", line 803, in numpy.random._generator.Generator.choice
ValueError: a cannot be empty unless no samples are taken

R-235

@JPonsa
Copy link
Author

JPonsa commented Apr 16, 2024

The models persist with a large dataset

Please, see below an example fo the documents

[Document(page_content='{"NCT00000173": {"protocolSection": {"identificationModule": {"nctId": "NCT00000173", "organization": {"fullName": "National Institute on Aging (NIA)", "class": "NIH"}, "briefTitle": "Memory Impairment Study (Mild Cognitive Impairment Study)", "officialTitle": "A Randomized, Double-Blind, Placebo-Controlled Trial of Vitamin E and Donepezil HCL (Aricept) to Delay Clinical Progression From Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD)"}, "statusModule": {"overallStatus": "COMPLETED"}, "sponsorCollaboratorsModule": {"leadSponsor": {"name": "National Institute on Aging (NIA)", "class": "NIH"}}, "descriptionModule": {"briefSummary": "The National Institute on Aging (NIA) is launching a nationwide treatment study targeting individuals with mild cognitive impairment (MCI), a condition characterized by a memory deficit, but not dementia. An NIA-funded study recently confirmed that MCI is different from both dementia and normal age-related changes in memory. Accurate and early evaluation and treatment of MCI individuals might prevent further cognitive decline, including development of Alzheimer's disease (AD).\n\nThe Memory Impairment Study is the first such AD prevention clinical trial carried out by NIH, and will be conducted at 65-80 medical research institutions located in the United States and Canada. This study will test the usefulness of two drugs to slow or stop the conversion from MCI to AD. The trial will evaluate placebo, vitamin E, and donepezil, an investigational agent approved by the Food and Drug Administration for another use. Vitamin E (alpha-tocopherol) is thought to have antioxidant properties, and was shown in a 1997 study to delay important dementia milestones, such as patients' institutionalization or progression to severe dementia, by about seven months."}}}}', metadata={'filename': 'NCT00000173'}), Document(page_content='{"NCT00000173": {"protocolSection": {"descriptionModule": {"detailedDescription": "This clinical trial will be a multicenter, randomized, double-blind, placebo- controlled, parallel-group study of vitamin E and donepezil in 720 subjects with mild cognitive impairment (MCI). Subjects will be randomized to one of three treatment groups (240 subjects per treatment group): 1) Placebo vitamin E and placebo donepezil plus a multivitamin daily. 2) Vitamin E (2,000 I) and placebo donepezil plus a multivitamin daily.3) Donepezil (10 mg) and placebo vitamin E plus a multivitamin daily.\n\nThe study will be conducted over three years, with clinical evaluations every 3 months for the first 6 months and then every 6 months. Subjects randomized to donepezil will start a dose of 5 mg daily. Donepezil will be increased to 10 mg after six weeks. Subjects randomized to vitamin E will start at 1,000 I daily. The dose of Vitamin E will be increased to 2,000 I after six weeks. There will be a 12-month recruitment period. The primary endpoint will be time to development of Probable or Possible AD according to NINCDS-ADRDA criteria. Upon determination of a clinical diagnosis of AD, documentation will be sent to the ADCS Coordinating Center and forwarded to the Central Review Committee for verification. Upon verification, of conversion to diagnosis of AD, subjects will stop taking the donepezil study medication or its corresponding placebo, without breaking the blind, and will be offered open label donepezil at a scheduled visit one month after the prior diagnostic visit. Donepezil will be offered to subjects who convert to AD until the subject completes three years from the baseline visit. Based on an estimated incidence of AD of 15% per year, the study has 85% power to detect a 33% or greater reduction in conversion to AD over 3 years. Secondary outcome measures will include change on the Alzheimer's Disease Assessment Scale (ADAS-COG), the Neuropsychological Battery, the Mini-Mental State Exam (MMSE), Clinical Dementia Rating Scale (CDR), the Global Deterioration Scale (GDS), ADCS- Activities of Daily Living Inventory (ADCS-ADL), a Pharmacoeconomics scale, and a Quality of Life scale. Compliance will be monitored through the measurement of alpha-tocopherol levels and pill counts at each visit."}}}}', metadata={'filename': 'NCT00000173'})]

@lucasmirachi
Copy link

lucasmirachi commented Apr 16, 2024

I'm also getting the same error.

Ragas version: 0.1.7
Python 3.10.13
Linux Machine

I've tried to generate a Synthetic Test Set in two distinct situations: one using multiple documents (about 40) and one using a single small document.

testset = generator.generate_with_langchain_docs(documents, 10, distributions, raise_exceptions= False)

In both situations, I got the following error:

Filename and doc_id are the same for all nodes.
Traceback (most recent call last):
File "/home/lucas/codecommit/module/test.py", line 55, in
testset = generator.generate_with_langchain_docs(documents, 10, distributions, raise_exceptions= False)
File "/home/lucas/.venv/lib/python3.10/site-packages/ragas/testset/generator.py", line 179, in generate_with_langchain_docs
return self.generate(
File "/home/lucas/.venv/lib/python3.10/site-packages/ragas/testset/generator.py", line 248, in generate
for n in self.docstore.get_random_nodes(k=test_size)
File "/home/lucas/.venv/lib/python3.10/site-packages/ragas/testset/docstore.py", line 328, in get_random_nodes
nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist()
File "numpy/random/_generator.pyx", line 803, in numpy.random._generator.Generator.choice
ValueError: a cannot be empty unless no samples are taken

@jjmachan jjmachan added the linear Created by Linear-GitHub Sync label Apr 28, 2024
@jjmachan jjmachan changed the title Testset generation. ValueError: a cannot be empty unless no samples are taken - Not use adaptor [R-235] Testset generation. ValueError: a cannot be empty unless no samples are taken - Not use adaptor Apr 28, 2024
@jjmachan
Copy link
Member

hey @JPonsa @lucasmirachi - sorry about the late reply but I'll get back once I've reproduced this (thank for making it super easy 🙌🏽 )

@JPonsa
Copy link
Author

JPonsa commented May 4, 2024

@jjmachan have you been able to reproduce the issue? I am using publicly available data so happy to share the code with you in case you need to reproduce it.

@jjmachan jjmachan modified the milestones: v0.1.8, v.2 May 6, 2024
@jjmachan jjmachan modified the milestones: v.2, v.3, v0.1.8, v.4 May 13, 2024
@Prabhjot410
Copy link

I am facing the same issue. can you please tell me how did you solve this. I am using huggingface pipeline as LLM.

@jjmachan jjmachan modified the milestones: v0.1.9, v.5 May 27, 2024
@jjmachan jjmachan modified the milestones: v.5, v.6 Jun 3, 2024
@SuroshAhmadZobair
Copy link

same issue with llamaindex

@SuroshAhmadZobair
Copy link

@jjmachan Hi
any updates?

@jjmachan jjmachan modified the milestones: v.6, v.7 Jun 10, 2024
@hshabbirh
Copy link

@jjmachan I'm facing the same issue

@jjmachan jjmachan modified the milestones: v.7, v.8, v.9 Jun 17, 2024
@jjmachan jjmachan removed this from the v.9 milestone Jun 25, 2024
@Saurabh8255
Copy link

same issue using model mistral7-b

@Rugved2204
Copy link

Hi any update on this issue?

@Saranyapramod66
Copy link

Saranyapramod66 commented Jul 28, 2024

Hi @jjmachan , Any update on this issue?

Traceback (most recent call last):
  File "C:\Users\SK_pr\Documents\test_set_generation.py", line 33, in <module>
    test_set = generator.generate_with_langchain_docs(documents, test_size=10, distributions=distributions,
  File "C:\Users\SK_pr\miniconda3\envs\env_rag\lib\site-packages\ragas\testset\generator.py", line 179, in generate_with_langchain_docs
    return self.generate(
  File "C:\Users\SK_pr\miniconda3\envs\env_rag\lib\site-packages\ragas\testset\generator.py", line 250, in generate
    print("Nodes: ",self.docstore.get_random_nodes(k=test_size))
  File "C:\Users\SK_pr\miniconda3\envs\env_rag\lib\site-packages\ragas\testset\docstore.py", line 331, in get_random_nodes
    nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist()
  File "numpy\\random\\_generator.pyx", line 803, in numpy.random._generator.Generator.choice
ValueError: a cannot be empty unless no samples are taken

The self.nodes and thereby 'prob' is being passed as an empty list in get_random_nodes method of docstore.py script. Can you please suggest a fix for this issue?

@jjmachan
Copy link
Member

@shahules786 can you take a look at this issue?

@vitalyshalumov
Copy link

Same problem with 0.1.13

@harveyphm
Copy link

I have the same problem with version 0.1.13

@wanjeakshay
Copy link

@dosu same issue for ragas Version: 0.1.20

Copy link

dosubot bot commented Sep 19, 2024

The "ValueError: a cannot be empty unless no samples are taken" error occurs when the generate_with_langchain_docs method in the RAGAS library attempts to sample nodes from an empty list. This issue has been reported by multiple users and is related to how documents are processed or generated in your setup.

Similar Issues and Solutions

  1. Issue #625: This issue discusses the same error and was closed by a pull request that fixed the keyphrase extractor. However, the root cause in your case might be different.

  2. Issue #1109: This open issue suggests ensuring that the list of documents is not empty before passing it to the TestsetGenerator. You can add a check to verify that documents are loaded successfully:

    if not documents:
        raise ValueError("No documents were loaded. Please check your query or parameters.")
  3. Issue #1194: This issue also discusses the same error and suggests ensuring that the docstore contains enough nodes to match the test_size. You can add a check to handle cases where there are not enough nodes:

    current_nodes = [
        CurrentNodes(root_node=n, nodes=[n])
        for n in self.docstore.get_random_nodes(k=test_size)
    ]
    
    if len(current_nodes) < test_size:
        raise ValueError("Not enough nodes in the docstore to generate the test set.")

Suggested Solution

Ensure that the documents list is not empty before calling the generate_with_langchain_docs method. Here is an example of how to do this:

documents = [...]  # Your list of LCDocument objects
if not documents:
    raise ValueError("The documents list cannot be empty.")

testset_generator = TestsetGenerator(...)
eval_ds = testset_generator.generate_with_langchain_docs(
    documents=documents,
    test_size=5,
    distributions={simple: 0.4, reasoning: 0.4, multi_context: 0.2},
    raise_exceptions=True
)

By checking that the documents list is not empty before calling the method, you can avoid the "ValueError: a cannot be empty unless no samples are taken" error [1][2].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working linear Created by Linear-GitHub Sync
Projects
None yet
Development

No branches or pull requests